- From: AJ Christensen <
>
- To: Daniel DeLeo <
>
- Cc: Adam Jacob <
>, Chef Dev <
>
- Subject: [[chef-dev]] Re: [[chef-dev]] Re: [[chef-dev]] CHEF-2224
- Date: Fri, 15 Apr 2011 10:42:21 +1200
Yo!
On 15 April 2011 10:36, Daniel DeLeo
<
>
wrote:
>
>
>
>
On Thursday, April 14, 2011 at 3:17 PM, Adam Jacob wrote:
>
>
On Thu, Apr 14, 2011 at 5:09 PM, AJ Christensen
>
<
>
>
wrote:
>
>
We've had failed first-run (but attributes-saved) nodes show up in
>
monitoring (via search), so I'm -1 on this being a bad behavior
>
change; it certainly is a change of behavior, but with some testing,
>
I'd totally support it.
>
>
I would argue this is exactly what you want - those nodes in fact
>
should be triggering alarms - the services they were supposed to be
>
running (given that your intent at bootstrap time was to have a
>
working system with that run list, not a non-existent one) - and they
>
now fail. You don't want the situation where the now-stranded systems
>
are not included in your monitoring because of a failed bootstrap, do
>
you?
>
>
I think that's exactly what I want. In the case that the chef run will
>
succeed, I don't want chef-client to run on the load balancer and pick up a
>
node that isn't running the application yet, and I don't want the monitoring
>
system to expect nodes to be running a service that hasn't been configured
>
yet. I only want these things to happen after the machine is in a working
>
state. Currently I have to do a dance to disable alerts in the NMS before
>
they start paging people when I'm bringing up a new node.
(we do the same dance; add nagios_notifications_disabled role,
converge, bring up nodes.. verify, remove
nagios_notifications_disabled role)
I replied off list, but have been thinking about a layer above chef
that our NMS confers with to determine good nodes to monitor.
I too am part of the camp that doesn't want bad nodes discovered, and
this is one of those cases.
>
In the failure case, I'm either going to be keeping an eye on the nodes and
>
look at why they failed, or if I'm creating a lot of them in an automated
>
way, I'll automate recovery or notification for the failure case.
>
>
>
I think this could be a regression too, cause I seem to recall a time
>
where the attributes weren't saved to the node prior to the
>
application of the resource collection.
>
>
It is a regression - there was a time when we didn't do this, and we
>
put this behavior in specifically for cases like the above.
>
>
In addition, this is a common early work pattern - you're tweaking
>
recipes, you're testing, and then you're building new systems from
>
scratch. The change away from storing the data early makes that loop
>
less intuitive (I've had 3 different people today comment on it.)
>
>
I've tweaked recipes and I've tested and I've built systems from scratch.
>
What I have not done is inspected the attribute data on the server to debug
>
recipes. Can you give an example of why this is a necessary or superior
>
debugging technique? The log output has always been much more useful to me,
>
and there's always the log resource if I need more.
Only times I've done this are when we're having trouble validating
that the on-disk changes are taking effect; most times we'd usually
just use increased logging & throw debug statements around.
>
The fewer decision points diagnosing node-bootstrap-failure rings true.
>
>
I feel like this is a red herring - if it brings you joy to include -j
>
/etc/chef/first-boot.json every time, go for it. :)
>
>
The issue isn't whether you enjoy using the -j flag, it's that the way
>
things are currently, you sometimes need it and sometimes don't. For
>
example, if I'm bootstrapping a node, and I forget to set the node_name and
>
I don't have a valid FQDN yet, chef will fail when it attempts to determine
>
the node name from the FQDN, *before* creating and saving the node. In this
>
case, I have to use -j after I fix the problem, because I never successfully
>
contacted the server. If chef-client fails after the initial save, I don't.
>
The only thing that works for every case is to re-run with -j.
>
>
>
Adam
>
>
--
>
Opscode, Inc.
>
Adam Jacob, Chief Product Officer
>
T: (206) 619-7151 E:
>
>
>
--
>
Daniel DeLeo
>
Software Design Engineer
>
Opscode, Inc.
>
Archive powered by MHonArc 2.6.16.