chef-dev - [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224

Subscribers: 756
Owners
Adam Jacob
Bryan McLellan
Joshua Timberman
Nathen Harvey
Seth Chisamore
Serdar Sutay

Subscribe
Unsubscribe
Info
Archive

Post

RSS
Shared documents

Chef Developers

[[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224

From: AJ Christensen < >
To: Daniel DeLeo < >
Cc: Adam Jacob < >, Chef Dev < >
Subject: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224
Date: Fri, 15 Apr 2011 10:42:21 +1200

Yo!

On 15 April 2011 10:36, Daniel DeLeo
< >
wrote:
>
>
>
> On Thursday, April 14, 2011 at 3:17 PM, Adam Jacob wrote:
>
> On Thu, Apr 14, 2011 at 5:09 PM, AJ Christensen
> < >
> wrote:
>
> We've had failed first-run (but attributes-saved) nodes show up in
> monitoring (via search), so I'm -1 on this being a bad behavior
> change; it certainly is a change of behavior, but with some testing,
> I'd totally support it.
>
> I would argue this is exactly what you want - those nodes in fact
> should be triggering alarms - the services they were supposed to be
> running (given that your intent at bootstrap time was to have a
> working system with that run list, not a non-existent one) - and they
> now fail. You don't want the situation where the now-stranded systems
> are not included in your monitoring because of a failed bootstrap, do
> you?
>
> I think that's exactly what I want. In the case that the chef run will
> succeed, I don't want chef-client to run on the load balancer and pick up a
> node that isn't running the application yet, and I don't want the monitoring
> system to expect nodes to be running a service that hasn't been configured
> yet. I only want these things to happen after the machine is in a working
> state. Currently I have to do a dance to disable alerts in the NMS before
> they start paging people when I'm bringing up a new node.

(we do the same dance; add nagios_notifications_disabled role,
converge, bring up nodes.. verify, remove
nagios_notifications_disabled role)

I replied off list, but have been thinking about a layer above chef
that our NMS confers with to determine good nodes to monitor.

I too am part of the camp that doesn't want bad nodes discovered, and
this is one of those cases.

> In the failure case, I'm either going to be keeping an eye on the nodes and
> look at why they failed, or if I'm creating a lot of them in an automated
> way, I'll automate recovery or notification for the failure case.
>
>
> I think this could be a regression too, cause I seem to recall a time
> where the attributes weren't saved to the node prior to the
> application of the resource collection.
>
> It is a regression - there was a time when we didn't do this, and we
> put this behavior in specifically for cases like the above.
>
> In addition, this is a common early work pattern - you're tweaking
> recipes, you're testing, and then you're building new systems from
> scratch. The change away from storing the data early makes that loop
> less intuitive (I've had 3 different people today comment on it.)
>
> I've tweaked recipes and I've tested and I've built systems from scratch.
> What I have not done is inspected the attribute data on the server to debug
> recipes. Can you give an example of why this is a necessary or superior
> debugging technique? The log output has always been much more useful to me,
> and there's always the log resource if I need more.

Only times I've done this are when we're having trouble validating
that the on-disk changes are taking effect; most times we'd usually
just use increased logging & throw debug statements around.

> The fewer decision points diagnosing node-bootstrap-failure rings true.
>
> I feel like this is a red herring - if it brings you joy to include -j
> /etc/chef/first-boot.json every time, go for it. :)
>
> The issue isn't whether you enjoy using the -j flag, it's that the way
> things are currently, you sometimes need it and sometimes don't. For
> example, if I'm bootstrapping a node, and I forget to set the node_name and
> I don't have a valid FQDN yet, chef will fail when it attempts to determine
> the node name from the FQDN, *before* creating and saving the node. In this
> case, I have to use -j after I fix the problem, because I never successfully
> contacted the server. If chef-client fails after the initial save, I don't.
> The only thing that works for every case is to re-run with -j.
>
>
> Adam
>
> --
> Opscode, Inc.
> Adam Jacob, Chief Product Officer
> T: (206) 619-7151 E:
>
>
> --
> Daniel DeLeo
> Software Design Engineer
> Opscode, Inc.
>

[[chef-dev]] CHEF-2224, Adam Jacob, 04/14/2011
- [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Daniel DeLeo, 04/14/2011
  - [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, AJ Christensen, 04/14/2011
    - [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Adam Jacob, 04/14/2011
      - Message not available
        
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, AJ Christensen, 04/14/2011
        
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Seth Chisamore, 04/14/2011
        
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Adam Jacob, 04/14/2011
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Seth Falcon, 04/14/2011
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Adam Jacob, 04/14/2011
        
        [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Tollef Fog Heen, 04/14/2011
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Joshua Timberman, 04/14/2011
        [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224, Daniel DeLeo, 04/18/2011