[chef] Re: Re: Re: Re: Re: Re: Re: Re: [chef-dev] Re: Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes


Chronological Thread 
  • From: Jesse Hu < >
  • To: ,
  • Subject: [chef] Re: Re: Re: Re: Re: Re: Re: Re: [chef-dev] Re: Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes
  • Date: Mon, 01 Sep 2014 14:16:50 +0800

Hi Stephen,

Thanks for your valuable tips.  There is a chef bug CHEF-ISSUE-1904 that chef-client doesn't retry when getting HTTP 50X errors from Chef Server. I made a patch for it : https://github.com/opscode/chef/pull/1912

With this patch, I'm able to provision a 500 nodes cluster using Chef 11 without any chef-client exception.  From the chef-client output, all chef-clients did retry about 100+ times in total and a single chef-client retried at most 2 times.

Here is my chef version and configuration:

Open Source Chef Server 11.1.4 on a CentOS 5.10 with 4 core CPU and 8G Memory
Chef Client 11.14.2

==== /etc/chef-server/chef-server.rb ====

chef_server_webui['enable'] = false
nginx['ssl_port'] = 9443
nginx['non_ssl_port'] = 9080
chef_solr['heap_size'] = 3072
chef_solr['commit_interval'] = 3000
chef_solr['poll_seconds'] = 6
chef_expander['nodes'] = 4
erchef['s3_parallel_ops_timeout'] = 30000
erchef['s3_url_ttl'] = 3600
erchef['depsolver_timeout'] = 5000
erchef['depsolver_worker_count'] = 10  # default value is 5 (per 2 CPU cores). So I set it to 10 for my 4 CPU cores.
erchef['db_pool_size'] = 250 # this is because I have a few Chef Node Save/Search API calls in the cookbook during chef-client execution.
postgresql['max_connections'] = 350


==== /etc/chef/client.rb  ====

log_location     STDOUT
chef_server_url  "https://hostname:9443"
validation_client_name "chef-validator"
node_name "mynode"

log_level :info
no_lazy_load true
ssl_verify_mode :verify_peer
ssl_ca_path "/etc/chef/trusted_certs"

# this is to reduce the chef node data size for speeding up Chef Search API calls
Ohai::Config[:disabled_plugins] = [:Azure, :Filesystem, :Cloudv2, :Virtualization, :Virtualizationinfo, :Dmi, :Zpools, :Blockdevice, :Lsb, :Nodejs, :Languages, :Php, :Lua, :Perl, :C, :Java, :Python, :Erlang, :Groovy, :Ruby, :Mono, :Os, :Openstack, :Cloud, :Rackspace, :Ps, :Command, :Initpackage, :Rootgroup, :Keys, :Sshhostkey, :Ohai, :Chef, :Ohaitime, :Passwd, :Gce, :Systemprofile, :Linode, :Ipscopes, :Eucalyptus, :Ec2] 

Thanks
Jesse Hu


Stephen Delano wrote On Fri, 1 Aug 2014 09:30:12 -0700 :
" type="cite">
Hi Steven,

I found that erchef['depsolver_worker_count'] is listed on http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html , but not on http://docs.getchef.com/config_rb_chef_server_optional_settings.html

Does it mean depsolver_worker_count is not applicable on Open Source Chef Server 11 ? 
If it works for Open Source Chef Server 11, how can I know how many depsolver_workers are running after I set it in /etc/chef-server/chef-server.rb ? 

Yes, this works as well in recent versions of Open Source Chef Server 11 (11.1+), and the docs should be updated to reflect this. A quick way too verify the number of depsolver workers that are running is executing `pgrep -fl depselector`. You'll see the number of ruby processes that are waiting to receive cookbook dependency problems for solving.
 
And erchef['depsolver_timeout'] = 5000   means the server side timeout when resolving cookbook dependencies? What's the chef-client side timeout when syncing cookbooks and how to increase it ?

Correct, 5000ms is the time that erchef gives depselector / gecode to reach a solution given the current cookboook universe, run list, and environment constraints.

What do you mean by "client side timeouts"? What timeouts are you seeing?
 

Thanks
Jesse Hu


Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :
Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu 
 
 " target="_blank"><
 > wrote:

If adding more CPU cores,  will increase erchef['depsolver_worker_
count'] and nginx['worker_processes'] solve the 503 issue? Both these
2 attributes' devalue value are related to CPU cores number ?

nginx['worker_processes'] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure.  You will, however,
need to manually bump the depsolver_worker_count.


<jesse>  Yes, the chef-client exits after getting 503.  How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb?  If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range.  However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn't show up in your tail command on the server.  It's possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I've
uploaded to a gist here:

https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files.  So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on.  If you'd like, you could email me (not the whole
list) the tarball and I can try to take a look.  It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ?  The depsolver workers number seems a blocker, unless giving more CPU cores.
For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc).  Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers.  But, in
most cases, the task the workers do complete quickly.  Further,
chef-client *should* retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers.  However, at some
point, you do just need to feed depsolver more CPUs.

<jesse>  I tried 'no_lazy_load false' in my 300 nodes cluster, it still threw 503 ERROR.  I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I'd better set no_lazy_load to true.
Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links.  However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven




--
Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104




Archive powered by MHonArc 2.6.16.

§