[chef] Re: Re: Re: Re: [chef-dev] Re: Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes


Chronological Thread 
  • From: Jesse Hu < >
  • To:
  • Subject: [chef] Re: Re: Re: Re: [chef-dev] Re: Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes
  • Date: Thu, 31 Jul 2014 13:05:35 +0800

Thanks a lot Steven & Stephen!

I'm not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections.  Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.
<jesse> I will try 1:1 ratio plus some headroom for erchef['db_pool_size'] and the number of concurrent chef clients. It should consume more CPU and memories. 

Is there a limit for number of the concurrent chef clients served by a single Chef Server 11 ?  The depsolver workers number seems a blocker, unless giving more CPU cores.

If adding more CPU cores,  will increase erchef['depsolver_worker_count'] and nginx['worker_processes'] solve the 503 issue? Both these 2 attributes' devalue value are related to CPU cores number ?

By returning a 503, the client should retry said request with an exponential backoff of up to 60 seconds between retries. Are these 503s stopping chef-client runs?
<jesse>  Yes, the chef-client exits after getting 503.  How can I tell the chef-client to retry syncing cookbook or calling Chef Node or Search APIs? Is there a settings in /etc/chef/client.rb?  If not, I may re-spawn chef-client when getting 503 error.

Some other observations and questions below:

    erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

<jesse> If using the default value with Chef Server 11.0.8, the 'knife cookbook upload -a' will hang forever, so I increased it. I'm not sure whether it reproduces in the latest Chef Server 11.1.1.


    no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

<jesse>  I tried 'no_lazy_load false' in my 300 nodes cluster, it still threw 503 ERROR.  I see there is a commit make no_lazy_load the default in 7 days ago. As it described, my chef-client might run for a long time (1~2 hours), so I think I'd better set no_lazy_load to true.


Thanks
Jesse Hu


Stephen Delano wrote On Tue, 29 Jul 2014 11:57:02 -0700 :
" type="cite">
Like Steven mentioned, the 503s are actually coming from a lack of depsolver workers, and that can be bumped accordingly. A few more should help address this, but I typically suggest not increasing that number past the number of CPUs you have available. In this case, you're already 2.5x oversubscribe. By returning a 503, the client should retry said request with an exponential backoff of up to 60 seconds between retries. Are these 503s stopping chef-client runs?

I also agree with Steven on the no_lazy_load. This will actually increase load on the server at the start of a chef-client run instead of spreading it out across the run. If you've found that your node bootstraps are taking longer than the 1 hour expiry time of erchef['s3_url_ttl'], then you can actually just bump this value to compensate for that instead.


On Tue, Jul 29, 2014 at 4:36 AM, Steven Danna < " target="_blank"> > wrote:
Hi,

> Is there a recommended ratio between erchef['db_pool_size'] and the number of concurrent chef clients ?

I'm not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections.  Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.

> Is the postgresql connection the only bottle neck to scale up Chef Server 11 ?

Based on this:

   2014-07-22 04:16:13.024 [error] {<<"method=POST; path=/environments/_default/
cookbook_versions; status=503; ">>,"Service Unavailable"}

My suspicion is that you might also be running out of depsolver
workers.  In order to take advantage of the gecode solver, erchef
starts a fixed number of workers that do nothing but solve dependency
problems (by default, we start 5). When all worker are busy, you'll
see a 503 on the API call in the error message above.  While the
number of workers is configurable, I'm skeptical about bumping the
number of workers as depsolving is largely a CPU bound activity and
you are already at 5 workers for your 2 CPUs.  However, if you are
willing to sacrifice latency for throughput, you might experiment with
bumping

  erchef['depsolver_worker_
count']

Some other observations and questions below:

    erchef['s3_parallel_ops_timeout'] = 30000

Did you see problems that led you to push this that high?

    no_lazy_load true

You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.

Cheers,

Steven

On Tue, Jul 29, 2014 at 7:14 AM, Jesse Hu < "> > wrote:
> Thanks Adam & Ranjib.
>
> Is there a recommended ratio between erchef['db_pool_size'] and the number
> of concurrent chef clients ? Do we need to tune the nginx and erchef itself
> ?  Is the postgresql connection the only bottle neck to scale up Chef Server
> 11 ?
>
> In my case, I have several chef node save calls and chef search API calls
> during the cookbook exection, so the Chef Server will throw 503 or EOF
> exception when saving the node or calling chef search.  The chef node save
> calls is to save current bootstrap progress and task ongoing, so we can tell
> the user what's being done on each node.  The chef search API calls is
> called when a node needs to find out dependent node running the specified
> service. The dependent node will add a tag to indicate it's running the
> service, and other nodes search nodes by the tag name.
>
> Thanks
> Jesse Hu | My Profile | Project Serengeti
>
> ________________________________
> Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :
>
> as adam mentioned, the corresponding parameters are, erchef['db_pool_size']
> & postgresql['max_connections'] . you should also observe chef-server's cpu
> profile during the chef runs. We had similar issues, and we had to bump up
> both the connection limits as well as provision a bigger chef server
> instance.
> even with new configs & instance, we did the stress testing to find out the
> max concurrent chef client runs we could do, and then add splay time between
> batches of chef run. Usually it only during the initial (cookbook sync,
> searches etc) and end phase (posting node data back) when chef client needs
> to talk to chef server, if you can design a way to distribute the
> communication over a time frame, it will give you a predictable, fleet wide
> chef run, where the chef server load or fleet wide chef run time will grow
> predictably as your cluster grows.
>
>
>
> On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob < "> > wrote:
>>
>> You have 300/concurrent connections - try tuning pool size and max db
>> connections larger than that.
>>
>> On Jul 28, 2014 8:57 PM, "Jesse Hu" < "> > wrote:
>>>
>>> Hi Chefs,
>>>
>>> I'm using Chef to bootstrap a 300 nodes cluster and a few chef-clients
>>> throws the error "503 Service Unavailable" when the 300 nodes started
>>> syncing cookbooks at the same time. The chef server log shows erchef
>>> crashed. Most of the time, syncing cookbooks on all the nodes again after
>>> the 1st failure will succeed.
>>>
>>> There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow to
>>> speed up syncing cookbooks. I applied the new chef-client which contains the
>>> fix.  However, the fix doesn't work in my case.
>>>
>>> Is there any tips to tune Open Source Chef Server 11 to serve hundreds of
>>> nodes ?  Thanks a lot.
>>>
>>>
>>> Here is my chef configuration:
>>>
>>> ## Chef Server ##
>>>
>>> 2 cores + 8G Mem
>>>
>>> $ cat /etc/issue
>>> CentOS release 5.10 (Final)
>>>
>>> $ rpm -q chef-server
>>> chef-server-11.1.1-1.el5
>>>
>>> $ cat /etc/chef-server/chef-server.rb
>>> chef_server_webui['enable'] = false
>>> nginx['ssl_port'] = 9443
>>> nginx['non_ssl_port'] = 9080
>>> chef_solr['heap_size'] = 3072
>>> chef_solr['commit_interval'] = 3000
>>> chef_solr['poll_seconds'] = 6
>>> chef_expander['nodes'] = 4
>>> erchef['s3_parallel_ops_timeout'] = 30000
>>> erchef['s3_url_ttl'] = 3600
>>> erchef['db_pool_size'] = 150
>>> postgresql['max_connections'] = 200
>>> rabbitmq['node_port'] = 5672
>>>
>>>
>>> ## Chef Client ##
>>>
>>> $ cat /etc/issue
>>> CentOS release 6.4 (Final)
>>>
>>> $ rpm -q chef
>>> chef-11.14.0.rc.2-1.el6.x86_64
>>>
>>> $ cat /etc/chef/client.rb
>>> log_location     STDOUT
>>> chef_server_url  "https://<fqdn>:9443"
>>> validation_client_name "chef-validator"
>>> node_name "bigtop-master-0"
>>>
>>> no_lazy_load true
>>> ssl_verify_mode :verify_peer
>>> ssl_ca_path "/etc/chef/trusted_certs"
>>>
>>>
>>> ## chef-server-ctl tail ==
>>>
>>> full output of "chef-server-ctl tail" is attached.
>>>
>>> ==> /var/log/chef-server/erchef/crash.log.1 <==
>>> ** Last event was:
>>> {log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,' "> '}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"080"]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"]}}
>>> ** When handler state ==
>>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>>> ",message,"\n"],[]}
>>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>>> ",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>>> 2014-07-20 22:40:02 =ERROR REPORT====
>>> {<<"method=GET; path=/_status; status=500; ">>,"Internal Server
>>> Error"}2014-07-20 22:40:02 =ERROR REPORT====
>>> ** gen_event handler lager_console_backend crashed.
>>> ** Was installed in lager_event
>>> ** Last event was:
>>> {log,{lager_msg,[],[{pid,<0.101.0>}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"085"]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>>> argument in call to io:put_chars(user, unicode, [\"[\",\"error\",\"]
>>> \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>>> in lager_console_backend:handle_event/2 line 95"]}}
>>> ** When handler state ==
>>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>>> ",message,"\n"],[]}
>>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>>> ",[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>>> argument in call to io:put_chars(user, unicode, [\"[\",\"error\",\"]
>>> \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>>> in lager_console_backend:handle_event/2 line
>>> 95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>>>
>>>
>>> ==> /var/log/chef-server/erchef/erchef.log <==
>>> 2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30506.3> exit with reason
>>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"358","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".application.byRequestType.","roles",".","GET"],58,"1866",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1834",124,"h","\n"],[["chefAPI",".upstreamReques...",...],...],...]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30501.3> exit with reason
>>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"356","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"2000",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1987",124,"h","\n"],[["chefAPI",".upstreamRequest...",...],...],...]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
>>> child stats_hero started with {stats_hero,start_link,undefined} at
>>> <0.30527.3> exit with reason
>>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"110","\n",[[["chefAPI",".application.allRequests"],58,"1",124,"m","\n"],[["chefAPI",".","10",".allRequests"],58,"1",124,"m","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"1",124,"m","\n"]]]}]}}
>>> in context child_terminated
>>> 2014-07-22 04:16:12.850 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:16:13.010 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:16:13.024 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:05.852 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.356 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.363 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>> 2014-07-22 04:17:13.433 [error] {<<"method=POST;
>>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>>> Unavailable"}
>>>
>>>
>>>
>>> Thanks
>>> Jesse Hu
>>>
>
>



--
Stephen Delano
Software Development Engineer
Opscode, Inc.
1008 Western Avenue
Suite 601
Seattle, WA 98104




Archive powered by MHonArc 2.6.16.

§