[chef] Re: Re: File created from template has broken (binary) and incorrect data in it after Chef run


Chronological Thread 
  • From: Ian Marlier < >
  • To:
  • Subject: [chef] Re: Re: File created from template has broken (binary) and incorrect data in it after Chef run
  • Date: Tue, 5 Jun 2012 21:24:38 -0400

I don't have debug logs for the time that this issue happened -- I don't run with debug logging enabled for the most part (though if this happens again, I might start!). 

Here's the INFO level logs from the period where the issue happened, trimmed a bit.  I've included the log lines from initialization (showing that no changed cookbooks/recipes were downloaded), and the log lines that show the operations on this particular resource.  The actions at 17:53 are a good copy of the file being moved out of the way and a broken copy being put into place.  The actions at 18:24 are the broken file being moved out of the way, and a good copy being put back in place.

[Mon, 04 Jun 2012 17:52:56 -0700] INFO: *** Chef 0.10.10 ***
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List is [role[falkland], role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Run List expands to [chef_handler, chef_handler::register_handlers, chef-client::service, bc_ec2, bc_ec2::pullkeys, openldap::client, users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt, python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker]
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Starting Chef Run for fijidb01.fal
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 17:53:05 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 17:53:06 -0700] INFO: Loading cookbooks [apache2, apt, aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef, chef-client, chef_handler, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git, gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate, modprobe, mongodb, motd, mysql, nagios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn, pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow, runit, stunnel, sudoers, sysctl, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen, xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNodejs light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeServer light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: NodeNpm light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonPip light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: PythonVirtualenv light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: ChefHandler light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AptRepository light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: BluepillService light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsElasticIp light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: AwsEbsVolume light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: DaemontoolsService light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Chef Handlers will be at: /var/chef/handlers
[Mon, 04 Jun 2012 17:53:07 -0700] INFO: Processing remote_directory[/var/chef/handlers] action create (chef_handler::default line 23)
<snip>
[Mon, 04 Jun 2012 17:53:24 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 17:53:24 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] backed up to /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[Mon, 04 Jun 2012 17:53:24 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] mode changed to 644
[Mon, 04 Jun 2012 17:53:24 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] updated content
<snip>
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List is [role[falkland], role[base], role[xen_guest], role[fiji_db_server]]
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Run List expands to [chef_handler, chef_handler::register_handlers, chef-client::service, bc_ec2, bc_ec2::pullkeys, openldap::client,
 users::root, users::ubuntu, openssh, dns::hosts, dns::resolv.conf, dns::nsupdate, hostname, motd, bash, yum::client, apt::brightcove, apt, python, diamond, zip, sudoers, nagios::nrpe, nagios::send_nsca, ntpd, genders, sysstat, syslog, xml, postfix, tmpreaper, xen::guest, mysql::default, mysql::nrpe, pacemaker::default, mysql::pacemaker][Mon, 04 Jun 2012 18:23:39 -0700] INFO: Starting Chef Run for fijidb01.fal[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Running start handlers
[Mon, 04 Jun 2012 18:23:39 -0700] INFO: Start handlers complete.
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Loading cookbooks [apache2, apt, aws, bash, bc_ec2, benson, bind9, bluepill, build-essential, bundler, chef, chef-client, chef_handle
r, couchdb, daemontools, dell, diamond, dns, erlang, fiji, genders, git, gluster, haproxy, hostname, ios_resque_endpoint, java, logrotate, modprobe, mongodb, motd, mysql, na
gios, newrelic, nginx, node, ntpd, openldap, openssh, openssl, openvpn, pacemaker, postfix, python, rabbitmq_chef, redis, resque, ruby-shadow, runit, stunnel, sudoers, sysct
l, syslog, sysstat, tmpreaper, ucspi-tcp, unicorn, users, varnish, xen, xinetd, xml, yum, zip, zlib]
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight provider already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNodejs light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeServer light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: NodeNpm light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonPip light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: PythonVirtualenv light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: ChefHandler light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AptRepository light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: BluepillService light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsElasticIp light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: AwsEbsVolume light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: DaemontoolsService light-weight resource already initialized -- overriding!
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Chef Handlers will be at: /var/chef/handlers
[Mon, 04 Jun 2012 18:23:44 -0700] INFO: Processing remote_directory[/var/chef/handlers] action create (chef_handler::default line 23)
<snip>
[Mon, 04 Jun 2012 18:24:16 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 18:24:16 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] backed up to /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
[Mon, 04 Jun 2012 18:24:16 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] mode changed to 644
[Mon, 04 Jun 2012 18:24:16 -0700] INFO: template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] updated content
[Mon, 04 Jun 2012 18:54:52 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 19:56:16 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:27:07 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)
[Mon, 04 Jun 2012 20:58:04 -0700] INFO: Processing template[/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg] action create (couchdb::nrpe line 16)


Copies of this file that exist in the Chef file backup cache:
[imarlier ~]$ find /var/chef/backup/ -type f -name "check_couch_two_way*" -ls
11108655    4 -rw-r--r--   1 root     root           82 Jun  4 17:53 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604182416
11108654    4 -rw-r--r--   1 root     root          128 Mar 17 13:19 /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324
[imarlier ~]$


The ticket that I created at http://tickets.opscode.com/browse/CHEF-3179 has copies of the broken file (as moved out of the way at 18:24) and the good file that was created at 18:24.  I can also upload a copy of the file that existed prior to the 17:53 chef-client run, but, well:

[imarlier ~]$ diff /var/chef/backup/usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg.chef-20120604175324 /usr/local/nagios/etc/nrpe.d/check_couch_two_way_replication.cfg
[imarlier ~]$


I'm like you in having flashbacks to other bugs in the config management world, which is why I think that this is a pretty critical issue to track down.  This is really way too reminiscent of 1010.  There appears to be some scenario that leads to data corruption within the context of a chef-client run, and it appears to be independent of chef-server.  If the data that was being corrupted was node attribute data, that would be one thing.  But this is resource attributes, and should have been set at the time that the template resource was created and added to the run list.  Shoot, the template knew it's name when the resource was created (because it put the broken file at the right path), but it didn't know it's name when it _wrote the template contents_ (because the command name is @params[:name])!




On Tue, Jun 5, 2012 at 6:26 PM, AJ Christensen < " target="_blank"> > wrote:
Please show full debug logs for the period and/or the contents of the
Chef File Backup Cache and/or logging statements regarding moving
files into the Backup Cache

This is pretty heavy stuff.. reminds me of a bug long ago, on a planet
far away in another configuration management universe.

Regards,

AJ

On 6 June 2012 02:13, Ian Marlier < "> > wrote:
> Hi, Chefs --
>
> After a Chef run on one of my machines last night, a text file on the
> system, created from a template, had binary data in it.  On the next Chef
> run, everything was fine.  I'm wondering if this is something that anyone
> else has experienced.  (This is also filed as
> http://tickets.opscode.com/browse/CHEF-3179, but I wanted to cast a wide net
> since it's not an easily searchable issue.)
>
> Basic details:
> Chef 0.10.10, Ruby 1.9.2p180, Ohai 0.6.12
>
>
> Many more details: read on!
>
> I have a resource definition called "nrpe_service".  It's used to define
> services for NRPE (Nagios Remote Plugin framework).  In this case, it's
> defined like this:
> nrpe_service "check_couch_two_way_replication" do
>   command "check_couch_replication.py --database $ARG1$ --source --dest"
> end
>
> The definition does some sanity checking (making sure that needed attributes
> are defined, etc), then does this if the sanity check works out:
>         template "#{node[:nrpe][:confdir]}/#{params[:name]}.cfg" do
>             cookbook "nagios"
>             source "nrpe_service.cfg.erb"
>             owner "root"
>             group "root"
>             mode "0644"
>             variables({:params => params})
>         end
>
>
> So far, so good.
>
> The template itself is pretty simple (though, wow, what was I thinking when
> I put some of the modification logic in the template instead of the resource
> definiton?  That was silly...):
> <% if @params[:command][0] == '/' %>
> <% cmd = @params[:command] %>
> <% else %>
> <% cmd = " :nagios][:plugin_path]}/ :command]}" %>
> <% end %>
> command[<%= @params[:name] -%>]=<%= cmd %>
>
>
> The resulting service definition looks like this:
> [imarlier nrpe.d]$ cat check_couch_two_way_replication.cfg
> command[check_couch_two_way_replication]=/usr/lib64/nagios/plugins/check_couch_replication.py
> --database $ARG1$ --source --dest
> [imarlier nrpe.d]$
>
>
> This has been in place, and has been working just fine, for months now.
> Then last night, chef-client ran (a scheduled run) and the file turned into
> this:
> [imarlier nrpe.d]$ cat
> check_couch_two_way_replication.cfg.chef-20120604182416
> command[1?en
>             way_replication]=/usr/lib64/nagios/plugins/["uid", 10003]
> [imarlier nrpe.d]$
>
>
> This is...wrong.
>
> The next time that Chef ran, everything went back to how it should be.
> There were no alterations to the Chef server at this time, and no local
> modifications on the client either.
>
> It's interesting to me that the things that appear to be corrupted are
> parameter values that should be derived entirely locally -- they're elements
> of the resource definition, and not parameters that are being pulled down
> from the server or anything like that.
>
> Has anyone seen similar behavior?  Given that this isn't something that
> seems to happen with any frequency -- and it's not something that I can
> reproduce at will -- anyone have ideas as to what I might do to debug this?
> Are there any known data corruption bugs in Chef?  In Ruby?  I'm kind of at
> a loss, but seeing something like this is pretty terrifying.
>
> Thanks,
>
> - Ian
>
>
> --
> Ian Marlier | Senior Systems Engineer
> Brightcove, Inc.
> 290 Congress Street, 4th Floor, Boston, MA 02110
> ">
>



--
Ian Marlier | Senior Systems Engineer
Brightcove, Inc.
290 Congress Street, 4th Floor, Boston, MA 02110
" target="_blank">




Archive powered by MHonArc 2.6.16.

§