[chef] Re: Re: Re: Re: Re: Double converge, blocking on search, event-driven chef-client?


Chronological Thread 
  • From: Tensibai Zhaoying < >
  • To:
  • Subject: [chef] Re: Re: Re: Re: Re: Double converge, blocking on search, event-driven chef-client?
  • Date: Thu, 02 Oct 2014 22:46:54 +0200

My 2 cts here: I use search for mysql clusters. And I do configure replication only if my search return a peer server. If not the recipe assume the server is meant to be standalone.

In any case the server register itself to the load balancer and to nagios by a WebServices (full stack involves glpi, centron and Nagios) so it's monitored as soon as possible within the run, the delay is due to Nagios itself and not to an indexing service.

On the Nagios box (and on the load balancer) there's a recipe using search to catch any leftover service/host for which the registration did not work.

It involves a lot of moving parts but for now it works ;)



---- Justin Dossey a écrit ----

Booker, you're exactly right. 

I was trying to use Chef as a service discovery tool.  It looks like Consul (which uses Serf internally) is a better fit for ephemeral services especially. 

Thanks for the tip.

On Thu, Oct 2, 2014 at 11:05 AM, Booker Bense < " target="_blank"> > wrote:
To me, it sounds like you're using the wrong hammer. Chef is a nice tool, but it's simply the wrong tool in this case. 
The underlying assumption in Chef and every CM system I've ever looked at is that anything that gets "registered" 
in the DB is going to be around long enough to be "stable" in some sense. 

I'm not sure if this will make sense or not, but the generalization of your example is where you have a client that needs
to find a dynamic pool of servers. I'm not aware of any CM that solves this problem well. Dynamic pool of 
clients to relatively static pool of servers is the model they were designed for. 

Something like serf might be much more appropriate to the kind of dynamic configuration you're talking about. 


- Booker C. Bense 

On Thu, Oct 2, 2014 at 9:52 AM, Justin Dossey < " target="_blank"> > wrote:
Thank you for focusing on the generic case rather than my hypothetical example.  This situation-- where one node needs to locate another via search in order to write its configuration correctly-- arises in many cases, not just with Nagios and NRPE. 

To review the responses (in order to create something useful for future readers)
  1. (paraphrased) Just wait for the second converge to complete for the node to be functional." This is what we do already-- a pattern like
    matches = search(...)
    then in resources, use only_if { ! matches.empty? }
    does a decent job here.  It's a bit more complex with wrapper cookbooks, but it's doable.
  2. (paraphrased) A blocking poll is not ideal because the number of matches necessary to complete convergence may vary as the infrastructure grows, leading to said poll requiring additional parameterization over time.  This is a source of unnecessary complexity.  Note that in my hypothetical, I was describing the NRPE server polling search for the Nagios server, not the Nagios server polling for nodes to monitor.
  3. (paraphrased) It is inappropriate to expect to be able to converge an entire infrastructure in a single pass, so we should deploy nodes in a defined sequence in order to minimize the number of client runs necessary to configure the infrastructure.  I disagree with this sentiment, as delaying service availability by a multiple of the client run interval adds to the complexity of the environment and increases the time to resolution for legitimate problems.
  4. (paraphrased) Use push jobs to notify nodes when the infrastructure is ready for them to converge.  The very first sentence on the Chef Push Jobs page is "Chef push jobs is an extension of the Chef server that allows jobs to be run against nodes independently of a chef-client run." We are talking about triggering individual resources within a cookbook within a chef-client run, so push jobs don't really address this issue at all.  Also, it would appear that push jobs were meant to be triggered by administrators and not by the chef-client.  While I'm sure it would be possible to set things up in such a way that a successful initial converge could trigger a push job, such an implementation diverges considerably from the role of push jobs as designed.
  5. Chef 12 doesn't have the 15-minute client token expiry in the same way Chef 11 does (by default).

Did I miss anything?  While I'm not surprised by the information here, it confirms that there is a functional hole here-- there are many situations in a converged infrastructure that require multiple client runs before the infrastructure is fully functional, and this means it takes longer than absolutely necessary to bring up a new infrastructure from scratch.

I am aware that implementation of callbacks, or resource hooks on remote nodes, is a huge job, there are other ways to approach the situation.  Another solution that comes to mind (besides the search-subscriber model): Make it easy for a recipe to advise the chef-client to reduce its run interval based on conditions detected in the chef-client run.  I like this one because it is useful beyond the specific situation-- it gives the chef-client a primitive way to learn about the infrastructure it manages and respond appropriately. 

At its core, I believe that requiring configuration to come via application-specific config files is the real issue here, but I don't expect to see network-wide integrated config resource support (such as in CoreOS's etcd) in most *nix tools in the near term, nor do I expect to see direct support for communication with Chef servers in our tools (so, for our trusty hypothetical example, nrpe could be instructed simply to perform a Chef search directly to determine the allowed nrpe-client IPs rather than have the chef-client write the nrpe.conf as part of convergence, then notify the nrpe server of the change).

TL;DR: Now I'm thinking that the best thing would be for the recipe to be able to advise the client to reduce the interval before the next run.  This would enable code like

result = search(...)

if result.empty?

  Chef::Client.advise_interval(300) # request a chef-client run 5 minutes (+/- splay) after this converge

end

# define resources below, using not_if { result.empty? } where appropriate



On Thu, Oct 2, 2014 at 8:53 AM, Lamont Granquist < " target="_blank"> > wrote:
On 10/1/14, 12:48 PM, Justin Dossey wrote:

would cause the recipe to poll the Chef server every ten seconds until a response came back, or until the client token expired (usually around 15 minutes).

Such behavior is not ideal (particularly because of the client token expiry issue) but besides supporting a second converge, it's the only way I have seen that will accomplish the desired result.

You can set `no_lazy_load true` in client.rb to avoid cookbook_file resources failing after 15m.  This will be the default in Chef 12 client.  The Chef 12 server also uses solr 4 and is going to populate search results in seconds rather than minutes.



--
Justin Dossey
Practice Owner
New Context Services, Inc




--
Justin Dossey
Practice Owner
New Context Services, Inc



Archive powered by MHonArc 2.6.16.

§