- From: aaron peterson <
>
- To:
- Subject: [chef] Re: Double converge, blocking on search, event-driven chef-client?
- Date: Thu, 2 Oct 2014 08:11:57 -0700
On Wed, Oct 1, 2014 at 12:48 PM, Justin Dossey
<
>
wrote:
>
With a brand-new network deployment, however, it is likely that nrpe will
>
converge on one node before the nagios server's entire run list converges on
>
another node. Therefore, the nrpe server recipe will have no results from
>
its search until the nagios server node converges successfully once. On the
>
first converge to take place after the nagios server's node data has been
>
stored in solr, the nrpe server will get data to write to the nrpe
>
configuration file.
>
>
This is what I mean by "double converge"-- it takes at least two converges
>
to complete the nrpe server installation and configuration because the
>
necessary data is not available in the first converge.
You're focusing on the window right at the initial provisioning time
for the whole network of machines, asking for a bunch of servers at
once including a brand new nagios server. The general case is to add
servers over time, and that is going to naturally take 1 run on a new
server and 1 run afterwards on the nagios server, because the new
servers *didn't exist* for all the prior chef runs, and can't be said
to be ready for service or monitoring until after their first chef
run.
>
>
One way to reduce the number of converges is to poll search until a result
>
comes back. Something like this in the nrpe-server recipe code:
>
>
results = []
>
do
>
results = search(...)
>
break unless results.empty?
>
sleep(10)
>
end
>
>
would cause the recipe to poll the Chef server every ten seconds until a
>
response came back, or until the client token expired (usually around 15
>
minutes).
>
>
Such behavior is not ideal (particularly because of the client token expiry
>
issue) but besides supporting a second converge, it's the only way I have
>
seen that will accomplish the desired result.
Every time chef runs on the nagios server is also a "poll", without
any downsides. Any reason that won't work here?
A server in a tight polling loop is not being actively managed by
chef, and this logic doesn't help you with stragglers. You could wait
for N servers, but you may not be sure N is the right number, and now
a single straggler or bad provision causes the nagios server to get
stuck, fail the run, and repeat, never getting past the poll.
>
One slightly wild idea kicking around would be to have some ability for the
>
client to register for an event, and associate a set of resources with the
>
event triggering. In our hypothetical, that would allow the nrpe server's
>
chef client to converge everything it could (perhaps, a set of other recipes
>
in the run list) and be idle until the nagios server's chef client completes
>
converging its run list.
Which is what chef push
https://docs.getchef.com/push_jobs.html can
do, and this could have the effect of shortening the lag time for a
new server to be monitored at the cost of additional complexity.
It's a hard problem to know when a distributed system reaches a
particular global state - e.g. "all servers that I want to monitor
have run chef once".
Any way you slice it, this is a lot of work to optimize for the very
specific case of the first nagios server chef run. Do you have a goal
is that makes this important or necessary? Why not just let the
infrastructure converge?
-Aaron Peterson
- [chef] Double converge, blocking on search, event-driven chef-client?, Justin Dossey, 10/01/2014
- [chef] Re: Double converge, blocking on search, event-driven chef-client?, Peter Burkholder, 10/01/2014
- [chef] Re: Double converge, blocking on search, event-driven chef-client?, aaron peterson, 10/02/2014
- [chef] Re: Double converge, blocking on search, event-driven chef-client?, Lamont Granquist, 10/02/2014
Archive powered by MHonArc 2.6.16.