- From: "Michael DeMan (CHF)" <
>
- To:
- Subject: [chef] Re: Right sizing Chef11 Server
- Date: Tue, 3 Sep 2013 08:43:50 -0700
Hi All,
Thanks for the responses.
We are going to schedule a window to change the splay time from 30 minutes
back to 5 minutes, try to reproduce the problem and look at things in more
detail.
Trimming down the amount of data ohai posts is another good idea, but we are
going to try that as a last resort.
Other background information is:
Original Chef10...
- was split out with separate Couch, API, webserver, etc
- hosts are decommissioned and I seem to recall the hosts had a variety of
CPU/RAM configuration depending on their purpose.
New Chef11...
- Single server, 8-core, 32GBRAM, chef version 11.0.8 on CentOS 6.4 and
almost all the clients are 11.4.4.
- The 500s were reported by the clients when connecting to the server, and it
occurred somewhere when we went from about 1200 clients to about 1400 clients.
- The 500s were reported on all clients.
When it happened, we were a bit panicked and intuitively decided to reduce
client load (moving splay time from 5-minutes to 30-minutes) fixed the
problem, but...
- No processes jumped out at us as 'hot'
- Nothing jumped out at us in the logs, but we were not really sure what to
be looking for, we know more now.
- There were plenty of CPU and RAM resources available on the host.
- We did not see any file/socket resource constraints via lsof.
- WAN connectivity to the data centers all seemed fine.
The fact that there was plenty of unused CPU/RAM available on the server yet
it seemed it could not keep up sent us in the direction of wondering about
tuning.
Thanks,
- Mike
On Sep 2, 2013, at 9:02 PM, Seth Falcon
<
>
wrote:
>
Hi Michael,
>
>
>
writes:
>
> What we did was turn up a single instance of Chef11 with 8-CPU and 32GB
>
> RAM.
>
> Guy before me went to all the Chef conferences, and I guess he must have
>
> drank
>
> the kook-aid because we migrated from Chef10, added a few hundred nodes,
>
> and
>
> Chef11 tipped over with '500'.
>
>
Providing details of how the server tipped over would help us help
>
you. When you write "tipped over with '500'", do you mean when you added
>
500 nodes, or that you started to see 500 errors from the server?
>
>
What do you observe when things tip over? Is there a hot process on the
>
Chef server? What do you see in the logs (sudo chef-server-ctl tail)?
>
What things, if any, have you tuned (what's in
>
/etc/chef-server/chef-server.rb)?
>
>
Upgrading from Chef 10 to Chef 11 for an infrastructure of your size was
>
a smart move. From everything I've seen, if you were to compare apples
>
to apples 10 vs 11, you will see a large difference.
>
>
> In all fairness, our original setup was set to have all nodes converge
>
> within a
>
> 5-minute splay time with a standard 30 minute cycle time. Meanwhile, our
>
> expectation was that Chef11 performs better. We also moved everything (no
>
> more
>
> Couch, etc) - onto a single server.
>
>
>
> Workaround we did was to increase splay time to 30-minutes within 30-minute
>
> schedule for now.
>
>
>
> My impression is that we just installed Chef11 and did not spend any time
>
> tuning the right knobs? I have seen some posts where Postgres and such is
>
> supposed to auto-size itself, but apparently that is based only on
>
> installation
>
> and re-sizing does not work?
>
>
>
>
>
> For a use case of:
>
> 5000 nodes
>
> 5 data centers
>
> expected latency being around 300ms
>
>
>
>
>
> Are there any knobs, dials, or other things that should be tuned to ensure
>
> that
>
> a single Chef11 instance can handle that? pg-sql, rabbit tuning jump to
>
> the
>
> forefront for me.
>
>
Here are a few things to look into:
>
>
1. Search the erchef logs for an error message containing
>
"no_connection". This is an indication that the pool of db client
>
connections is exhausted. You can tune this via `erchef['db_pool_size']`
>
in chef-server.rb. Keep in mind that erchef will open connection on
>
startup and is ultimately limited by the configured max in postgres (see
>
`postgresql['max_connections']`).
>
>
2. Do you have recipes that execute searches that typically return all
>
or nearly all nodes in your infrastructure (dead give away is a query
>
like "*:*")? Such searches are relatively costly since all of the node
>
data will be fetched from the db and sent to the client. You can reduce
>
the impact by making use of more focused queries and by using the
>
partial search API.
>
>
3. Review the size of your node data. You may be able to disable some
>
ohai plugins and greatly reduce the size of the node data without losing
>
data of interest.
>
>
+ seth
>
>
--
>
Seth Falcon | Development Lead | Opscode | @sfalcon
>
Archive powered by MHonArc 2.6.16.