[chef] Re: Re: Right sizing Chef11 Server


Chronological Thread 
  • From: Mark Mzyk < >
  • To:
  • Subject: [chef] Re: Re: Right sizing Chef11 Server
  • Date: Sat, 31 Aug 2013 10:43:27 -0400

Hey Michael,

This isn't my area of expertise so I'll poke one of the other Chef devs who knows more to weigh in, but I'm curious if you can provide more info on your setup. I'd like to know what your Chef 10 setup looked like, just for comparison to Chef 11. While your Chef 11 setup seems to be hurting, what was the baseline hardware/number of boxes you had with Chef 10? Also, for completely clarity - this is open source Chef right? I just want to make sure we're all on the same page.

On to actually trying to solve your problem. Having everything on a single box might be causing some of the backup. While Chef 11 is much more performant than Chef 10 if you're throwing 5000 nodes at it with everything on a single box that might hurt some. Typically we haven't seen much need to tune postgres. You might need to look at upping the connection count on postgres, but as far as I know that is usually the only tuning that is done.

I'm not aware of much rabbit tuning that typically happens either, but Solr, that sits on the other end of rabbit might need some tuning. Out of the box it has some fairly vanilla settings and so you might see improvements if you look there.

What Jeff said is valid. Cutting down on node data sent frees up not only network but what Solr has to ingest.

Could you do possibly do some more monitoring on the box and try to figure out where the bottleneck is? That would certainly make it easier to give recommendations.

In the meantime I'll ask one of the other engineers to weigh in. I'll also follow up and see if we can't get a doc page on ways to tune Chef, as that seems like it could prove helpful.

- Mark Mzyk
Opscode Software Engineer


" type="cite">
" photoname="Jeff Blaine" src="jpgGqsAxq4TPG.jpg" name="postbox-contact.jpg" height="25px" width="25px">
August 31, 2013 10:20 AM

One of the first things that comes to mind, having nothing else
to offer aside from "start finding the bottleneck", is reduction
of node data saved with every Chef run. That might help.

See: https://github.com/opscode/whitelist-node-attrs

"Allows you to provide a whitelist of node attributes to save on the server. All of the attributes are still available throughout the chef run, but only those specifically listed will be saved to the server."

Ohai's full output on the CentOS 6.4 box I just tested on
returns 28KB(!) of data, 99% of which I have never wanted
to query the server for yet. So you could find at least some
I/O gain by whitelisting most of it based on your needs. If
your needs change, change the whitelist.
" photoname="" src="jpgWMdA9cliVd.jpg" name="compose-unknown-contact.jpg" height="25px" width="25px">
August 31, 2013 4:30 AM
Hi All,
I am wondering about any guidelines on 'right sizing' a Chef11 server. I
understand things like your mileage may vary - but meanwhile usually with a
popular community supported product which also has a commercial edition there
are usually at least basic guidelines.

My situation is that we have approximately 2500 nodes distributed across four
data centers. We have about 350ms round trip to the worst case data center.

What we did was turn up a single instance of Chef11 with 8-CPU and 32GB RAM.
Guy before me went to all the Chef conferences, and I guess he must have drank
the kook-aid because we migrated from Chef10, added a few hundred nodes, and
Chef11 tipped over with '500'.

In all fairness, our original setup was set to have all nodes converge within a
5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
expectation was that Chef11 performs better. We also moved everything (no more
Couch, etc) - onto a single server.

Workaround we did was to increase splay time to 30-minutes within 30-minute
schedule for now.

My impression is that we just installed Chef11 and did not spend any time
tuning the right knobs? I have seen some posts where Postgres and such is
supposed to auto-size itself, but apparently that is based only on installation
and re-sizing does not work?

Sorry for lengthy post, sometimes context helps. Questions are:



For a use case of:
5000 nodes
5 data centers
expected latency being around 300ms


Are there any knobs, dials, or other things that should be tuned to ensure that
a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to the
forefront for me.


Thanks in advance for help since this is my first post to the community and
generally like your product.

- Michael deMan



Archive powered by MHonArc 2.6.16.

§