chef - [chef] Re: Right sizing Chef11 Server

Subscribers: 1946
Owners
Bryan McLellan
Joshua Timberman
Nathen Harvey
Seth Chisamore
Serdar Sutay

Subscribe
Unsubscribe
Info
Archive

Post

RSS
Shared documents

General discussion about Chef

[chef] Re: Right sizing Chef11 Server

From: Seth Falcon < >
To:
Subject: [chef] Re: Right sizing Chef11 Server
Date: Mon, 02 Sep 2013 21:02:50 -0700

Hi Michael,

writes:
> What we did was turn up a single instance of Chef11 with 8-CPU and 32GB
> RAM.
> Guy before me went to all the Chef conferences, and I guess he must have
> drank
> the kook-aid because we migrated from Chef10, added a few hundred nodes, and
> Chef11 tipped over with '500'.

Providing details of how the server tipped over would help us help
you. When you write "tipped over with '500'", do you mean when you added
500 nodes, or that you started to see 500 errors from the server?

What do you observe when things tip over? Is there a hot process on the
Chef server? What do you see in the logs (sudo chef-server-ctl tail)?
What things, if any, have you tuned (what's in
/etc/chef-server/chef-server.rb)?

Upgrading from Chef 10 to Chef 11 for an infrastructure of your size was
a smart move. From everything I've seen, if you were to compare apples
to apples 10 vs 11, you will see a large difference.

> In all fairness, our original setup was set to have all nodes converge
> within a
> 5-minute splay time with a standard 30 minute cycle time.  Meanwhile, our
> expectation was that Chef11 performs better.  We also moved everything (no
> more
> Couch, etc) - onto a single server.
>
> Workaround we did was to increase splay time to 30-minutes within 30-minute
> schedule for now.
>
> My impression is that we just installed Chef11 and did not spend any time
> tuning the right knobs?  I have seen some posts where Postgres and such is
> supposed to auto-size itself, but apparently that is based only on
> installation
> and re-sizing does not work?
>
>
> For a use case of:
> 5000 nodes
> 5 data centers
> expected latency being around 300ms
>
>
> Are there any knobs, dials, or other things that should be tuned to ensure
> that
> a single Chef11 instance can handle that?   pg-sql, rabbit tuning jump to
> the
> forefront for me.

Here are a few things to look into:

1. Search the erchef logs for an error message containing
"no_connection". This is an indication that the pool of db client
connections is exhausted. You can tune this via `erchef['db_pool_size']`
in chef-server.rb. Keep in mind that erchef will open connection on
startup and is ultimately limited by the configured max in postgres (see
`postgresql['max_connections']`).

2. Do you have recipes that execute searches that typically return all
or nearly all nodes in your infrastructure (dead give away is a query
like "*:*")? Such searches are relatively costly since all of the node
data will be fetched from the db and sent to the client. You can reduce
the impact by making use of more focused queries and by using the
partial search API.

3. Review the size of your node data. You may be able to disable some
ohai plugins and greatly reduce the size of the node data without losing
data of interest.

+ seth

--
Seth Falcon | Development Lead | Opscode | @sfalcon

[chef] Re: Right sizing Chef11 Server, Seth Falcon, 09/02/2013
- [chef] Re: Right sizing Chef11 Server, Michael DeMan (CHF), 09/03/2013