- From: Steven Danna <
>
- To:
- Subject: [chef] Re: Re: [chef-dev] Re: Performance Tuning for Open Source Chef Server 11 to serve hundreds of nodes
- Date: Tue, 29 Jul 2014 12:36:04 +0100
Hi,
>
Is there a recommended ratio between erchef['db_pool_size'] and the number
>
of concurrent chef clients ?
I'm not sure I would go as far to call it a recommendation, but I
typically start at 1:1 ratio plus some headroom if I know the machine can
support that many postgresql connections. Note that as you increase
the number of connections in the pool, erchef does take a bit longer
to start up.
>
Is the postgresql connection the only bottle neck to scale up Chef Server
>
11 ?
Based on this:
2014-07-22 04:16:13.024 [error] {<<"method=POST;
path=/environments/_default/
cookbook_versions; status=503; ">>,"Service Unavailable"}
My suspicion is that you might also be running out of depsolver
workers. In order to take advantage of the gecode solver, erchef
starts a fixed number of workers that do nothing but solve dependency
problems (by default, we start 5). When all worker are busy, you'll
see a 503 on the API call in the error message above. While the
number of workers is configurable, I'm skeptical about bumping the
number of workers as depsolving is largely a CPU bound activity and
you are already at 5 workers for your 2 CPUs. However, if you are
willing to sacrifice latency for throughput, you might experiment with
bumping
erchef['depsolver_worker_
count']
Some other observations and questions below:
erchef['s3_parallel_ops_timeout'] = 30000
Did you see problems that led you to push this that high?
no_lazy_load true
You should consider keeping lazy_load on for bootstrapping unless you
really need it off, this will save a bit on the number of files you
need to sync concurrently.
Cheers,
Steven
On Tue, Jul 29, 2014 at 7:14 AM, Jesse Hu
<
>
wrote:
>
Thanks Adam & Ranjib.
>
>
Is there a recommended ratio between erchef['db_pool_size'] and the number
>
of concurrent chef clients ? Do we need to tune the nginx and erchef itself
>
? Is the postgresql connection the only bottle neck to scale up Chef Server
>
11 ?
>
>
In my case, I have several chef node save calls and chef search API calls
>
during the cookbook exection, so the Chef Server will throw 503 or EOF
>
exception when saving the node or calling chef search. The chef node save
>
calls is to save current bootstrap progress and task ongoing, so we can tell
>
the user what's being done on each node. The chef search API calls is
>
called when a node needs to find out dependent node running the specified
>
service. The dependent node will add a tag to indicate it's running the
>
service, and other nodes search nodes by the tag name.
>
>
Thanks
>
Jesse Hu | My Profile | Project Serengeti
>
>
________________________________
>
Ranjib Dey wrote On Mon, 28 Jul 2014 21:56:28 -0700 :
>
>
as adam mentioned, the corresponding parameters are, erchef['db_pool_size']
>
& postgresql['max_connections'] . you should also observe chef-server's cpu
>
profile during the chef runs. We had similar issues, and we had to bump up
>
both the connection limits as well as provision a bigger chef server
>
instance.
>
even with new configs & instance, we did the stress testing to find out the
>
max concurrent chef client runs we could do, and then add splay time between
>
batches of chef run. Usually it only during the initial (cookbook sync,
>
searches etc) and end phase (posting node data back) when chef client needs
>
to talk to chef server, if you can design a way to distribute the
>
communication over a time frame, it will give you a predictable, fleet wide
>
chef run, where the chef server load or fleet wide chef run time will grow
>
predictably as your cluster grows.
>
>
>
>
On Mon, Jul 28, 2014 at 9:23 PM, Adam Jacob
>
<
>
>
wrote:
>
>
>
> You have 300/concurrent connections - try tuning pool size and max db
>
> connections larger than that.
>
>
>
> On Jul 28, 2014 8:57 PM, "Jesse Hu"
>
> <
>
>
> wrote:
>
>>
>
>> Hi Chefs,
>
>>
>
>> I'm using Chef to bootstrap a 300 nodes cluster and a few chef-clients
>
>> throws the error "503 Service Unavailable" when the 300 nodes started
>
>> syncing cookbooks at the same time. The chef server log shows erchef
>
>> crashed. Most of the time, syncing cookbooks on all the nodes again after
>
>> the 1st failure will succeed.
>
>>
>
>> There is a ticket Chef -4423 Chef Synchronization of Cookbook Slow to
>
>> speed up syncing cookbooks. I applied the new chef-client which contains
>
>> the
>
>> fix. However, the fix doesn't work in my case.
>
>>
>
>> Is there any tips to tune Open Source Chef Server 11 to serve hundreds of
>
>> nodes ? Thanks a lot.
>
>>
>
>>
>
>> Here is my chef configuration:
>
>>
>
>> ## Chef Server ##
>
>>
>
>> 2 cores + 8G Mem
>
>>
>
>> $ cat /etc/issue
>
>> CentOS release 5.10 (Final)
>
>>
>
>> $ rpm -q chef-server
>
>> chef-server-11.1.1-1.el5
>
>>
>
>> $ cat /etc/chef-server/chef-server.rb
>
>> chef_server_webui['enable'] = false
>
>> nginx['ssl_port'] = 9443
>
>> nginx['non_ssl_port'] = 9080
>
>> chef_solr['heap_size'] = 3072
>
>> chef_solr['commit_interval'] = 3000
>
>> chef_solr['poll_seconds'] = 6
>
>> chef_expander['nodes'] = 4
>
>> erchef['s3_parallel_ops_timeout'] = 30000
>
>> erchef['s3_url_ttl'] = 3600
>
>> erchef['db_pool_size'] = 150
>
>> postgresql['max_connections'] = 200
>
>> rabbitmq['node_port'] = 5672
>
>>
>
>>
>
>> ## Chef Client ##
>
>>
>
>> $ cat /etc/issue
>
>> CentOS release 6.4 (Final)
>
>>
>
>> $ rpm -q chef
>
>> chef-11.14.0.rc.2-1.el6.x86_64
>
>>
>
>> $ cat /etc/chef/client.rb
>
>> log_location STDOUT
>
>> chef_server_url "https://<fqdn>:9443"
>
>> validation_client_name "chef-validator"
>
>> node_name "bigtop-master-0"
>
>>
>
>> no_lazy_load true
>
>> ssl_verify_mode :verify_peer
>
>> ssl_ca_path "/etc/chef/trusted_certs"
>
>>
>
>>
>
>> ## chef-server-ctl tail ==
>
>>
>
>> full output of "chef-server-ctl tail" is attached.
>
>>
>
>> ==> /var/log/chef-server/erchef/crash.log.1 <==
>
>> ** Last event was:
>
>> {log,{lager_msg,[],[{application,chef_wm},{module,chef_wm_status},{function,log_failure},{line,77},{pid,"<0.23096.41>"},{node,'
'}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"080"]},{1405,896002,80894},[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"]}}
>
>> ** When handler state ==
>
>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>
>> ",message,"\n"],[]}
>
>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>
>> ",[47,95,115,116,97,116,117,115,"\n","{{status,fail},{upstreams,{[{<<\"chef_solr\">>,<<\"fail\">>},{<<\"chef_sql\">>,<<\"pong\">>}]}}}"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>
>> 2014-07-20 22:40:02 =ERROR REPORT====
>
>> {<<"method=GET; path=/_status; status=500; ">>,"Internal Server
>
>> Error"}2014-07-20 22:40:02 =ERROR REPORT====
>
>> ** gen_event handler lager_console_backend crashed.
>
>> ** Was installed in lager_event
>
>> ** Last event was:
>
>> {log,{lager_msg,[],[{pid,<0.101.0>}],error,{["2014",45,"07",45,"20"],["22",58,"40",58,"02",46,"085"]},{1405,896002,85731},[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>
>> argument in call to io:put_chars(user, unicode, [\"[\",\"error\",\"]
>
>> \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>
>> in lager_console_backend:handle_event/2 line 95"]}}
>
>> ** When handler state ==
>
>> {state,{mask,127},lager_default_formatter,["[",severity,"]
>
>> ",message,"\n"],[]}
>
>> ** Reason == {badarg,[{io,put_chars,[user,unicode,["[","error","]
>
>> ",[103,101,110,95,101,118,101,110,116,32,"lager_console_backend",32,105,110,115,116,97,108,108,101,100,32,105,110,32,"lager_event",32,116,101,114,109,105,110,97,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"bad
>
>> argument in call to io:put_chars(user, unicode, [\"[\",\"error\",\"]
>
>> \",[47,95,115,116,97,116,117,115,\"\\n\",\"{{status,fail},{upstreams,{[{<<\\\"chef_solr\\\">...\"],...])
>
>> in lager_console_backend:handle_event/2 line
>
>> 95"],"\n"]],[]},{lager_console_backend,handle_event,2,[{file,"src/lager_console_backend.erl"},{line,95}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,504}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,486}]},{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,248}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>
>>
>
>>
>
>> ==> /var/log/chef-server/erchef/erchef.log <==
>
>> 2014-07-22 04:15:48.514 [error] Supervisor stats_hero_worker_sup had
>
>> child stats_hero started with {stats_hero,start_link,undefined} at
>
>> <0.30506.3> exit with reason
>
>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"358","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"1866",124,"h","\n"],[["chefAPI",".application.byRequestType.","roles",".","GET"],58,"1866",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1834",124,"h","\n"],[["chefAPI",".upstreamReques...",...],...],...]]}]}}
>
>> in context child_terminated
>
>> 2014-07-22 04:15:48.600 [error] Supervisor stats_hero_worker_sup had
>
>> child stats_hero started with {stats_hero,start_link,undefined} at
>
>> <0.30501.3> exit with reason
>
>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"356","\n",[[["chefAPI",".application.byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".","10",".byStatusCode.","200"],58,"1",124,"m","\n"],[["chefAPI",".application.allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".","10",".allRequests"],58,"2000",124,"h","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"2000",124,"h","\n"],[["chefAPI",".upstreamRequests.","rdbms"],58,"1987",124,"h","\n"],[["chefAPI",".upstreamRequest...",...],...],...]]}]}}
>
>> in context child_terminated
>
>> 2014-07-22 04:15:48.613 [error] Supervisor stats_hero_worker_sup had
>
>> child stats_hero started with {stats_hero,start_link,undefined} at
>
>> <0.30527.3> exit with reason
>
>> {timeout,{gen_server,call,[<0.925.0>,{send,[49,124,"110","\n",[[["chefAPI",".application.allRequests"],58,"1",124,"m","\n"],[["chefAPI",".","10",".allRequests"],58,"1",124,"m","\n"],[["chefAPI",".application.byRequestType.","data",".","GET"],58,"1",124,"m","\n"]]]}]}}
>
>> in context child_terminated
>
>> 2014-07-22 04:16:12.850 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:16:13.010 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:16:13.024 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:17:05.852 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:17:13.356 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:17:13.363 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>> 2014-07-22 04:17:13.433 [error] {<<"method=POST;
>
>> path=/environments/_default/cookbook_versions; status=503; ">>,"Service
>
>> Unavailable"}
>
>>
>
>>
>
>>
>
>> Thanks
>
>> Jesse Hu
>
>>
>
>
Archive powered by MHonArc 2.6.16.