[chef] Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: chef-client (still) randomly failing


Chronological Thread 
  • From: Daniel DeLeo < >
  • To:
  • Subject: [chef] Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: Re: chef-client (still) randomly failing
  • Date: Fri, 22 Nov 2013 10:16:47 -0800


On Friday, November 22, 2013 at 10:09 AM, Phil Cryer wrote:

so nothing I can see is locking yum, assuming it's something chef is kicking off. What I do (when things aren't working) is login, stop chef-client, kill any chef processes hanging around and make sure there's no /var/run/yum.pid. Then I manually run chef-client, it works fine. I do it again to be sure it can still work, that's fine, then I loggout. hours later, it's stuck, in this condition again. This is across 4 VMware VMs, each are just running apache and php, each has 2Gig RAM. They're CentOS 6.4. I want to get it fixed because I want chef-client to check in regularly so I have all green when I run knife status. Anything else you can think of to debug what's happening to make it fail (seemingly) randomly? Thanks

But the constant here is that there’s _always_ a /var/run/yum.pid and there is no process matching that pid (or it’s been recycled)? If that’s the case, the next step is to figure out what process that is, and why it’s dying without cleaning up after itself.

-- 
Daniel DeLeo
 


On Thu, Nov 21, 2013 at 8:02 PM, Daniel DeLeo < " target="_blank"> > wrote:

On Thursday, November 21, 2013 at 5:47 PM, Phil Cryer wrote:

-> I threw this in the end of one of my recipes:
require 'mixlib/shellout/version'
log "the shellout version is #{Mixlib::ShellOut::VERSION}" do
  level :warn
end

-> ran it, and the log says:
  * log[the shellout version is 1.3.0.rc.0] action write[2013-11-21T19:42:11-06:00] WARN: the shellout version is 1.3.0.rc.0

happy to test it again, or another gem, whatever - just let me know. Thanks

On Thu, Nov 21, 2013 at 1:51 PM, Daniel DeLeo < " target="_blank"> > wrote:
On Thursday, November 21, 2013 at 11:35 AM, Phil Cryer wrote:
Daniel, sorry to say this still fails with the new mixlib-shellout. It fails, then fails every other time chef-client runs and never works. I think it may still be an issue with the yum.pid getting in the way. To fix, I ssh in and do:

-> stop chef-client
# /etc/init.d/chef-client stop
Stopping chef-client:                                      [  OK  ]

-> make sure all chef process are done
# ps -fe|grep chef
root     14303 14279  0 13:29 pts/0    00:00:00 grep chef

-> remove the stale yum pid
# rm /var/run/yum.pid

-> run chef-client
# chef-client

And it just works. So, here are the logs from when it was failing, notice it continued to try every hour...

So is there a live process that has the yum lock? Or it’s truly stale and the pid in the file is a dead process? Also is this machine tight on RAM, and is there anything in the OOM killer log? At first glance I can’t think of what chef could do automatically for that case. Perhaps it could read the pid file and check if the process is alive, but that seems fraught with peril and prone to race conditions.

In any case if there are not any stale `yum-dump.py` processes hanging around then this behavior isn’t caused by MIXLIB-16, so I might have sent you on a wild goose chase.





--




Archive powered by MHonArc 2.6.16.

§