[chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)


Chronological Thread 
  • From: Allan Carroll < >
  • To:
  • Subject: [chef] Re: Chef Server Hardware Reqs (was Re: Chef stability?)
  • Date: Wed, 17 Nov 2010 17:27:35 -0700
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:mime-version:content-type:subject:date:in-reply-to:to :references:message-id:x-mailer; b=Rsj0GwPyl16fOmYgy8sLD+YmM9co2cYgp6Iahkh+tOYzzgZ04JGAeyHTB/g1/UkwWI ofRzGXb6cPaemQb6efjou6Jb2QEVMEUa86xIAY1Z6oU9vml2QjDKp7fHuWh+rtApHSjK gCiCK4+v/oirkvkWE1xxEyY+kOVORrWzPBem8=

That's likely the same problem I'm having. I've been trying to run my Chef server off of a machine with 700GB (EC2 micro instance).

This begs the larger question:  what size of machine is recommended for running Chef? Seems like a pretty beefy system with all the parts running.

-Allan

On Nov 17, 2010, at 4:52 PM, Blake Barnett wrote:

I found that solr would crash reliably if the machine had a shortage of memory.  If I increased the RAM allocated to the VM to ~2GB, it behaved much more reliably.

-Blake

On Nov 18, 2010, at 4:37 AM, Allan Carroll wrote:

Whew. That makes it seem tractable. Thanks for helping zero in on this.


Here's what I dug up:


solr-indexer.log has no real clues. 

Lots of these:

INFO: Indexing node 37192f37-447a-41c7-8480-c048c878743e from chef status error Connection refused - connect(2)}

and lots of these:

INFO: Indexing cookbook_version 2bd0feeb-3e32-4bb2-867c-41e0cfa12806 from chef status ok}

solr.log also doesn't seem to have anything interesting, but here's the last set of output before it went away last time:


Here's a typical failure from the server log:

 

On Nov 17, 2010, at 11:16 AM, Adam Jacob wrote:

On Wed, Nov 17, 2010 at 10:09 AM,  < "> > wrote:
I've been working the past few days on tweaking my chef scripts to go into
production on EC2 and struggling to get anything I feel good about trusting.
Chef looks like a great tool with a strong community. I'm hoping that there's
some Chef way of looking at the world I haven't been exposed to that you can
all enlighten me on.

I'm running Ubuntu 10.10 on EC2 with the version of chef from the Opscode Lucid
repo (0.9.8).

A few things going on:

I can't seem to keep chef-solr or chef-solr-indexer from crashing. I keep
having to restart them for some reason. I'm using It makes everything feel
really flakey, but I'm not convinced that's the only thing I'm running into.

This is going to be the source of several problems - can you send us a
gist of what you get in the logs when these crash?

Sometimes the webui (and knife) show the status of all the nodes and sometimes
it refuses saying that I have no nodes (even though the node list shows there
are some there). The error in the logs is only the same 500 internal server
error: connection refused that I see for lots of things.

Those pages both use search - if you are seeing consistent failures of
Solr, thats the source of these issues.

Running chef-client by hand on a machine causes a different result than letting
the timer driven version work. Like it forces the client to reevaluate all the
data bags and search results and actually apply them.

In what way?  The code paths here are identical for the most part.  If
you're using data bags and search in the recipes, and you are seeing
failures of Solr, I would wager that these differences are actually
just a representation of the search service not being stable for you.

Sometimes the clients get new data/nodes and update everything fine, sometimes
they don't.

Again, if it's data that comes from search, that's your issue.

Yesterday I started 8 boxes to bring a whole cluster up. On a few of them, Chef
just randomly stopped working. Running chef-client by hand finished building
the box correctly. One of them built part of a configuration file using data
from a node that I had deleted off the Chef server a few hours earlier and then
could never get out of that state. Deleting the configuration file and
rerunning client fixed it.

All the symptoms you talk about sound search related - so we should
focus there. :)

Anyway, all of these small, but annoying, little glitches give me a really bad
feeling about trusting Chef to manage my production infrastructure. Of the
tools I've looked at, it's the most promising.

Sorry to hear that, but it's been quite stable for us (and for lots of
other folks).  We'll get you fixed up.

I'd really like to given the promise of such powerful ability when it works,
the time that I've put into it, and the time it will save. Is anyone using Chef
at a large scale? Does it take handholding and massaging along the way, and
that's just the price for cutting-edge technology that will be solved as the
code matures?

There are people using Chef at the scale of many thousands of systems,
and Opscode manages a production multi-tenant infrastructure that is
also quite significant, using many of the same components that are in
the open source Chef.

Happy to help - hook us up with the logs.

Best,
Adam

--
Opscode, Inc.
Adam Jacob, CTO
T: (206) 508-7449 E: ">






Archive powered by MHonArc 2.6.16.

§