[chef] Re: Push jobs vs SSH


Chronological Thread 
  • From: Lamont Granquist < >
  • To:
  • Subject: [chef] Re: Push jobs vs SSH
  • Date: Fri, 06 Feb 2015 09:18:27 -0800

On 2/6/15 5:34 AM, Eric Horne wrote:
What's the difference between chef push jobs and knife ssh?

I'm researching a use case for chef in which we want highly controlled deployments to orchestrate across several different systems. The traditional "pull" doesn't fit the bill well because it is difficult to get that to coordinate properly across systems.

From what I've read so far, chef push jobs are essentially a daemon running on the remote servers that allow arbitrary (but pre-defined/whitelisted) execution of commands. Aside from perhaps a cleaner white-listing concept (over forced-ssh), how is this different (better) than just using knife ssh?

I'm failing to see the benefits or use cases of chef push jobs over knife ssh. The documentation is lacking in terms of how it is intended to be used. Are push jobs better suited for different situations (and what are those situations?)

Thanks for the help!

-Eric
Scaling is a big problem with ssh. Doing jobs over ssh really buckles when you start hitting a thousand servers as the target for one job. The protocol is slow and cpu hungry, at scale it can take hours to hit all the servers just from the ssh client overhead on the central box, which leads to having to build fan out to distribute ssh connections over multiple source servers. Its also unreliable. You have to wrap it with your own failure checks and timeouts and then sometimes it just fails for no reason because its designed as an interactive login protocol first and foremost and its reliability as an "RPC" mechanism is poor -- so you need to detect that and retry individual failed hosts or else just re-run the job you're doing multiple times. The way that ssh trust test to grow in an organization also tends to result in really bad security holes. I've seen bidirectional full meshes of ssh trust constructed across 25,000 servers so that any compromise on any one system would lead to login access on all the other servers (at a company with otherwise really good security -- but ssh trust was way too 'easy' and 'useful').

At a small scale of 400 servers or so it can work fine, you don't see the anomalous failures often enough, the runs are short enough (and probably faster with current horsepower and lots of cores), and its generally contained enough that you can stay on top of the security issues. Scale it out, though, and you'll eventually find where its really not designed to do that. And 'knife ssh' would need a bunch more work to make it more reliable (I don't know of any other tool that wraps ssh that does it any better, though, since mostly people find that inflection point where ssh starts to be a really poor tool for the job and then ditch the protocol).

And that's also having said that I don't know enough about what we've built to comment on what layers we've added on top. The ability to set a job to run and report back success if a quorum of servers succeed (but failure if not enough servers succeed) is a vital higher layer that knife ssh certainly doesn't provide and becomes important when you can just about guarantee that one or two of your servers may be down, but that's fine (if you're pushing software to 2,500 targets all at once, you'll find that linux itself often just isn't reliable enough to guarantee that they're all up), yet you want to know if half your deployments fail because that's really bad.




Archive powered by MHonArc 2.6.16.

§