Thanks Bryan, Logging on Debug now. Results from stalls last night: [Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: Processing package[update-notifier-common] on slowbro [Tue, 24 Jan 2012 00:23:25 +1300] INFO: Processing package[update-notifier-common] action install (apt::default line 37) [Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: package[update-notifier-common] checking package status for update-notifier-common [Tue, 24 Jan 2012 00:23:25 +1300] DEBUG: sh(apt-cache policy update-notifier-common) (Note the stall here for 2 minutes) [Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] current version is 0.99.3ubuntu0.1 [Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] candidate version is 0.99.3ubuntu0.1 [Tue, 24 Jan 2012 00:25:29 +1300] DEBUG: package[update-notifier-common] is already installed - nothing to do There were 2 more with stalls in the same place for different packages. So the stall appears to be happening between executing apt-cache policy, but before any output is returned? Running apt-cache policy for the 3 packages produces nothing strange looking to me: Investigating a bit further (taking a guess the issue is in shell_out) leads me to: https://github.com/opscode/mixlib-shellout/blob/master/lib/mixlib/shellout/unix.rb This box is on Ubuntu 10.04 LTS with ruby 1.8: ruby -v ruby 1.8.7 (2010-01-10 patchlevel 249) [x86_64-linux] Would it be worth upgrading ruby? (schisamo – you appear to have authored shell_out, any idea if those hacks are required for that version of ruby?) Could we change the GC.disable to only happen on the affected versions of ruby? (I could remove it and do some more test runs) And finally, I ran chef-client in a loop until the issue occurred, the results were as follows (after quite a few runs): [Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: Processing package[iftop] on slowbro [Tue, 24 Jan 2012 09:32:12 +1300] INFO: Processing package[iftop] action install (iftop::default line 10) [Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: package[iftop] checking package status for iftop [Tue, 24 Jan 2012 09:32:12 +1300] DEBUG: sh(apt-cache policy iftop) iftop: Installed: 0.17-16 Candidate: 0.17-16 Version table: *** 0.17-16 0 500 http://nz.archive.ubuntu.com/ubuntu/ lucid/universe Packages 100 /var/lib/dpkg/status Output stopped here, memory usage shot up to 16GB+ (machine has 32GB), cpu sat at 0, nothing more happened. I left it running for 10 minutes before killing it with kill -9 (ctrl-c wasn’t responding). Any other suggestions are very much appreciated. Cheers, From: Bryan McLellan [mailto:
It doesn't look like you have logging set to DEBUG either. Try that and maybe check your log rotation to protect your disk space. Bryan On Jan 19, 2012 5:41 PM, "David Leaver" <
">
> wrote: From “knife node show --long" right now on both nodes the attributes look fine. Spits out 140KB file
for one node and 98KB for the other. (Most of that is the list of cpus and abilities) Will keep that in mind and run it immediately after it breaks next time. Cheers, From: Bryan
McLellan [mailto:
" target="_blank">
]
I wonder if a node attribute is growing out of control. Have you looked at the node attributes on the server after this run finishes? Bryan On Jan 19, 2012 3:40 PM, "David Leaver" <
" target="_blank">
> wrote: We are running ubuntu 10.04 LTS, x64. chef-client is running with the default service installed with the chef .deb packages. On two of our machines when chef-client runs and does a check it randomly (once a day to once a week) takes longer than normal (30s becomes 3 minutes+) and takes up a lot more memory
than chef should. 5GB memory usage:
http://pastebin.com/YKXXFRaR (Taken from a machine running right now which had the issue earlier this morning, I’ve left chef-client running at 5GB in case there is anything that can be pulled from
it). For each of the machines having this problem we have an identically configured machine, same chef roles etc (secondary for failover) which is not exhibiting any of the issues (These
machines also have almost no load, whereas the ones having issues have plenty of load). I really don’t have any idea where to start with investigating this, so any help will be much appreciated! The runlist for each machine has apt (currently 1.1.1, will update that) which installs a few packages, however none of these have changed recently, so no packages should be being
installed. Each machine has been running with the exact same runlist for quite a while (6 months+), and only started exhibiting this problem in the last month. The chef logs when these memory/time issues occur appear in the log as follows: [Fri, 20 Jan 2012 07:48:06 +1300] INFO: Processing easy_install_package[jonpy] action install (mapnik::default line 43) [Fri, 20 Jan 2012 07:49:57 +1300] INFO: Processing package[php5-gd] action install (mapscript::default line 10) (jonpy is already installed, yet it took 2 minutes to check?) Or more commonly like this: [Mon, 09 Jan 2012 01:05:31 +1300] INFO: Processing package[snmpd] action install (snmpd::default line 10) [Mon, 09 Jan 2012 01:08:12 +1300] INFO: Processing service[snmpd] action nothing (snmpd::default line 14) …. [Mon, 09 Jan 2012 01:08:27 +1300] INFO: Processing package[apache2] action install (apache2::default line 10) [Mon, 09 Jan 2012 01:10:05 +1300] INFO: Processing service[apache2] action nothing (apache2::default line 14) (Both packages already installed and happily running, yet took a long time) One machine is on chef 10.8, the other on 10.0. Runlist from machines: If there are any further details that may help investigate this I will happily supply them. Cheers, Dave. |
Archive powered by MHonArc 2.6.16.