[chef] Re: Re: Double converge, blocking on search, event-driven chef-client?


Chronological Thread 
  • From: Justin Dossey < >
  • To:
  • Subject: [chef] Re: Re: Double converge, blocking on search, event-driven chef-client?
  • Date: Thu, 2 Oct 2014 09:52:34 -0700

Thank you for focusing on the generic case rather than my hypothetical example.  This situation-- where one node needs to locate another via search in order to write its configuration correctly-- arises in many cases, not just with Nagios and NRPE. 

To review the responses (in order to create something useful for future readers)
  1. (paraphrased) Just wait for the second converge to complete for the node to be functional." This is what we do already-- a pattern like
    matches = search(...)
    then in resources, use only_if { ! matches.empty? }
    does a decent job here.  It's a bit more complex with wrapper cookbooks, but it's doable.
  2. (paraphrased) A blocking poll is not ideal because the number of matches necessary to complete convergence may vary as the infrastructure grows, leading to said poll requiring additional parameterization over time.  This is a source of unnecessary complexity.  Note that in my hypothetical, I was describing the NRPE server polling search for the Nagios server, not the Nagios server polling for nodes to monitor.
  3. (paraphrased) It is inappropriate to expect to be able to converge an entire infrastructure in a single pass, so we should deploy nodes in a defined sequence in order to minimize the number of client runs necessary to configure the infrastructure.  I disagree with this sentiment, as delaying service availability by a multiple of the client run interval adds to the complexity of the environment and increases the time to resolution for legitimate problems.
  4. (paraphrased) Use push jobs to notify nodes when the infrastructure is ready for them to converge.  The very first sentence on the Chef Push Jobs page is "Chef push jobs is an extension of the Chef server that allows jobs to be run against nodes independently of a chef-client run." We are talking about triggering individual resources within a cookbook within a chef-client run, so push jobs don't really address this issue at all.  Also, it would appear that push jobs were meant to be triggered by administrators and not by the chef-client.  While I'm sure it would be possible to set things up in such a way that a successful initial converge could trigger a push job, such an implementation diverges considerably from the role of push jobs as designed.
  5. Chef 12 doesn't have the 15-minute client token expiry in the same way Chef 11 does (by default).

Did I miss anything?  While I'm not surprised by the information here, it confirms that there is a functional hole here-- there are many situations in a converged infrastructure that require multiple client runs before the infrastructure is fully functional, and this means it takes longer than absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote nodes, is a huge job, there are other ways to approach the situation.  Another solution that comes to mind (besides the search-subscriber model): Make it easy for a recipe to advise the chef-client to reduce its run interval based on conditions detected in the chef-client run.  I like this one because it is useful beyond the specific situation-- it gives the chef-client a primitive way to learn about the infrastructure it manages and respond appropriately. 

At its core, I believe that requiring configuration to come via application-specific config files is the real issue here, but I don't expect to see network-wide integrated config resource support (such as in CoreOS's etcd) in most *nix tools in the near term, nor do I expect to see direct support for communication with Chef servers in our tools (so, for our trusty hypothetical example, nrpe could be instructed simply to perform a Chef search directly to determine the allowed nrpe-client IPs rather than have the chef-client write the nrpe.conf as part of convergence, then notify the nrpe server of the change).

TL;DR: Now I'm thinking that the best thing would be for the recipe to be able to advise the client to reduce the interval before the next run.  This would enable code like

result = search(...)

if result.empty?

  Chef::Client.advise_interval(300) # request a chef-client run 5 minutes (+/- splay) after this converge

end

# define resources below, using not_if { result.empty? } where appropriate



On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist < " target="_blank"> > wrote:
On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until a response came back, or until the client token expired (usually around 15 minutes).

Such behavior is not ideal (particularly because of the client token expiry issue) but besides supporting a second converge, it's the only way I have seen that will accomplish the desired result.

You can set `no_lazy_load true` in client.rb to avoid cookbook_file resources failing after 15m.  This will be the default in Chef 12 client.  The Chef 12 server also uses solr 4 and is going to populate search results in seconds rather than minutes.



--
Justin Dossey
Practice Owner
New Context Services, Inc



Archive powered by MHonArc 2.6.16.

§