[chef] Re: Re: Re: Re: Re: Re: Re: Re: Re: [chef-dev] Re: PerformanceTuning for Open Source Chef Server 11 to serve hundreds of nodes


Chronological Thread 
  • From: Jesse Hu < >
  • To:
  • Subject: [chef] Re: Re: Re: Re: Re: Re: Re: Re: Re: [chef-dev] Re: PerformanceTuning for Open Source Chef Server 11 to serve hundreds of nodes
  • Date: Sat, 02 Aug 2014 16:40:54 +0800

Thanks Steven, James. 

I see there are 5 depselector as expected.

$ pgrep -fl depselector
4243 ruby /opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4319 ruby /opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4321 ruby /opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4347 ruby /opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb
4349 ruby /opt/chef-server/embedded/service/erchef/lib/chef_objects-e6f9f10/priv/depselector_rb/depselector.rb

Correct, 5000ms is the time that erchef gives depselector / gecode to reach a solution given the current cookboook universe, run list, and environment constraints.

What do you mean by "client side timeouts"? What timeouts are you seeing?
 
Here I mean if I want to increase the depsolver_timeout to give more time for resolving the cookbook dependency, should I also need to increase the chef-client timeout for syncing the cookbooks to make chef-client timeout longer than depsolver_timeout.

Thanks
Jesse Hu


James Scott wrote On Fri, 1 Aug 2014 09:35:07 -0700 :
" type="cite">


On Fri, Aug 1, 2014 at 9:30 AM, Stephen Delano < " target="_blank"> > wrote:
Hi Steven,

I found that erchef['depsolver_worker_count'] is listed on http://docs.getchef.com/config_rb_chef_server_enterprise_optional_settings.html , but not on http://docs.getchef.com/config_rb_chef_server_optional_settings.html

Does it mean depsolver_worker_count is not applicable on Open Source Chef Server 11 ? 
If it works for Open Source Chef Server 11, how can I know how many depsolver_workers are running after I set it in /etc/chef-server/chef-server.rb ? 

Yes, this works as well in recent versions of Open Source Chef Server 11 (11.1+), and the docs should be updated to reflect this. A quick way too verify the number of depsolver workers that are running is executing `pgrep -fl depselector`. You'll see the number of ruby processes that are waiting to receive cookbook dependency problems for solving.
 
And erchef['depsolver_timeout'] = 5000   means the server side timeout when resolving cookbook dependencies? What's the chef-client side timeout when syncing cookbooks and how to increase it ?

Correct, 5000ms is the time that erchef gives depselector / gecode to reach a solution given the current cookboook universe, run list, and environment constraints.

What do you mean by "client side timeouts"? What timeouts are you seeing?
 

Thanks
Jesse Hu


Steven Danna wrote On Thu, 31 Jul 2014 09:25:22 +0100 :
Hi,

On Thu, Jul 31, 2014 at 6:05 AM, Jesse Hu 
 
 " target="_blank"><
 > wrote:

If adding more CPU cores,  will increase erchef['depsolver_worker_
count'] and nginx['worker_processes'] solve the 503 issue? Both these
2 attributes' devalue value are related to CPU cores number ?

nginx['worker_processes'] should automatically scale based on CPU
count if you run a chef-server-ctl reconfigure.  You will, however,
need to manually bump the depsolver_worker_count.


<jesse>  Yes, the chef-client exits after getting 503.  How can I tell the
chef-client to retry syncing cookbook or calling Chef Node or Search APIs?
Is there a settings in /etc/chef/client.rb?  If not, I may re-spawn
chef-client when getting 503 error.

By default, any API calls should be retried if they receive a HTTP
response code in the 500-599 range.  However, I believe that, the
calls to actually download the cookbook content from s3/bookshelf are
not retried.

I suspected depsolver because of the serverside error messages from
erchef, but perhaps you are seeing other errors as well that just
didn't show up in your tail command on the server.  It's possible that
bookshelf or nginx is also failing.

For the server The Chef support team wrote a handy script that I've
uploaded to a gist here:

https://gist.githubusercontent.com/stevendanna/279658e5fb3961f4b347/raw/1bf2afae25a05a0b3699ded6cb80139fa6250046/gistfile1.txt

if you run it with the argument OSC, it will create a tarball of the
last bits of all of your log files.  So if you generate some failures
and then run it, it should contain all the easily reachable evidence
of what is going on.  If you'd like, you could email me (not the whole
list) the tarball and I can try to take a look.  It would also be
helpful if you sent along an example of a chef-client run that is
failing with debug logging turned on.

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ?  The depsolver workers number seems a blocker, unless giving more CPU cores.
For any system there is a limit of how many concurrent connections it
can handle based on the available resources (CPU, RAM, etc) and
various limits imposed by the operating system (max file handles, max
processes, etc).  Since EC has a number of components it is hard to
say what the hard limit is for a given machine.

Depsolving is limited by the number of depsolver workers.  But, in
most cases, the task the workers do complete quickly.  Further,
chef-client *should* retry 503s when depsolving, so the number of
concurrent /chef-client runs/ that can be handled typically far
exceeds simply the number of depsolver workers.  However, at some
point, you do just need to feed depsolver more CPUs.

<jesse>  I tried 'no_lazy_load false' in my 300 nodes cluster, it still threw 503 ERROR.  I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I'd better set no_lazy_load to true.
Since you are running your own server; however, you can set your
s3_url_ttl to anything you want, avoiding the problem of expired
links.  However, I think it is probably best to definitively identify
where your bottleneck is before tweaking more settings anyway.

Cheers,

Steven




--
Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104





Archive powered by MHonArc 2.6.16.

§