[chef-dev] Fwd: How do I know if my application has really been "provisioned"? a suggestion


Chronological Thread 
  • From: Erik Hollensbe < >
  • To: Chef Dev < >, " " < >
  • Subject: [chef-dev] Fwd: How do I know if my application has really been "provisioned"? a suggestion
  • Date: Sun, 9 Dec 2012 10:28:01 -0800

Sorry for breaking the thread -- when I first signed up I used a plus hack address and your list software is stricter than I thought. :)

Anyhow my reply is included here.

Begin forwarded message:

Subject: Re: How do I know if my application has really been "provisioned"? a suggestion
Date: December 9, 2012 10:21:21 AM PST


On Dec 9, 2012, at 4:22 AM, Bryan Berry < "> > wrote:
Erik Hollensbe is doing some freaking awesome work on workflow
orchestration w/ chef-workflow and I think it illustrates the problem
here

require 'chef-workflow/helper'
class MyTest < MiniTest::Unit::VagrantTestCase
def before
   @json_msg = '{ 'id': "dumb message json msg"}'
end
def setup
  provision('elasticsearch')
  provision('logstash')
  wait_for('elasticsearch')
  wait_for('logstash')
  inject_logstash_message(@json_msg)
end

def test_message_indexed_elasticsearch
  assert es_has_message?(@json_msg)
end
end

If I understand Erik's code correctly, the `wait_for('elasticsearch')`
only waits for the vagrant provisioner to return. The vagrant
provisioner in turn only waits for `service elasticsearch start` to
return a non-zero exit-code.

Not exactly. It doesn't matter for the purposes of this discussion, but I feel compelled to explain anyway: chef-workflow's provisioner is multithreaded and dependency-based out of the box. When you ask something to be provisioned, it gets scheduled for a provision and a scheduler in the background tries to provision it as soon as all dependencies are satisifed for it, but it doesn't actually wait for anything to happen other than the message to be sent to the scheduler. In this time it may provision other machines that are needed to satisfy any requirements of the machine or groups of machines.

The wait_for statement is simply a way to say, "I can't continue until this machine actually exists" but is not coupled with a provision statement at all -- the behavior you're seeing is partially a side effect of being unable to multithread vagrant and virtualbox for provisioning (the knife side of this is already multithreaded, and the gains are huge when you provision more than one machine at a time for a specific role).

This is relevant because my current task is supporting ec2 as a first-class provisioner which means that in your test this would actually be quite a bit faster:

def setup
 provision('elasticsearch')
 provision('logstash', 1, %w[elasticsearch]) # logstash depends on ES here
 wait_for('logstash')
 inject_logstash_message(@json_msg)
end

Because the scheduler cares about the logstash dependency on ES now. If you needed to provision other machines, you could throw these wait_for statements in the unit tests themselves and literally have your tests be provisioning tons of machines in the background but not actually halt the testing process until the one you care about hasn't been provisioned, which has signficant gain over time as they're all provisioning at once whether or not your test suite has made it to the point where they matter yet.

Anyhow, this is important to point out because I think this dependency system and parallelism code can be adapted to chef converges -- because resource converge lists and this work extremely similar from a conceptual standpoint, and I'm about to suggest an alternative that would solve this problem in a way that lets that happen, should the actual patches be written. Please raise your hand if you'd like chef to try and parallelize as much as it can about your converge. :P

We need an optional way to determine whether an server has been
complete provisioned, or that all the resources have entered a "done"
state.  The only way I know that elasticsearch has started
successfully is if I see in the log "Elasticsearch has started" w/ a
timestamp more recent than when I started the service.

The before block would run before the service is actually actioned.
Now Chef would need some additional machinery to collect all the done
:after blocks and the related @before_results. This could be done by
chef_handler but may be better as part of chef itself. Let's call it
the done_handler for now. This done_handler would mark the time before
it starts handling any done_after blocks, then loop through the
collected done_after blocks for the specified timeout. Once all blocks
are complete it would continue onto other handlers, such as the
minitest_handler.

I think I have a more general suggestion that takes its cue from typical exception handling schemes in languages, but not exactly.

When you have an exception in ruby, the program aborts unless you catch it. Here's an example:

def foo
 something_that_might_raise
rescue
 $stderr.puts "omg! we raised"
end

This is a common problem in writing routines like 'foo':

def foo
 create_a_file
 something_that_might_raise
 delete_that_file
rescue
 $stderr.puts "omg! we raised"
end

The problem being that if "something_that_might_raise" does indeed raise an exception, no amount of error handling is going to get "delete_that_file" called.

Luckily ruby (and other languages that use exceptions) provides us with "ensure", which allows us to specify a bit of code executed that always runs no matter what happens. The right way to write the last example:

def foo
 create_a_file
 something_that_might_raise
rescue
 $stderr.puts "omg! we raised"
ensure
 delete_that_file if file_exists
end

Saying that a chef converge is exceptions-as-flow-control isn't exactly a leap of logic -- a resource application breaks and chef blows up -- that's the end of the story. Your job is to write your cookbooks and recipes in a way that's tolerant of these issues.

Our ensure block can raise a clearer error, but it can also clean up and it can also verify that indeed, some side effect worked. You can see above that it checks if the file exists before attempting to delete it -- in the event the create_a_file call failed, it does nothing.

Anyhow, this long-winded explanation more or less amounts to a simplification of what you're asking for -- an ensure block that spans all resource classes.

service "foo" do
 action :start
 ensure do
   sleep 10
   # ensures the socket the service foo created is open
   TCPSocket.new('localhost', 8675309)
 end
end

But it's also general enough to gracefully handle failures:

cookbook_file "foo.tar.gz" do
  action :create
end

execute "untar foo.tar.gz" do
 code <<-EOF
   tar xzf foo.tar.gz
 EOF
 ensure do
   FileUtils.rm('foo.tar.gz') # always runs, even if the above untar fails
 end
end

A corrolary call would be allowing some kind of state predicate to determine if the ensure block is being fired due to success or failure. These could be implemented as super-sets of ensure:

service "foo" do
 action :start
 success do
   # check socket
 end

 failure do
   # maybe kill process if it got started anyway?
 end
end

(Which is the way things like jquery's ajax tooling works)

Or with a simple predicate you check yourself:

service "foo" do
 action :start
 ensure do
   if success
     # check socket
   else
     # maybe kill process?
   end
 end
end

It'd be nice if notifies worked here too -- so you could signal another resource to run depending on what happened.

Anyhow, I think this is considerably more general and solves many more use-cases than this specific problem, but does not exempt it from being handled. Anyhow, back to my hole. :)

-Erik




Archive powered by MHonArc 2.6.16.

§