2012-02-25

Developing Operability

I had the pleasure of being invited to speak at SuperConf in Miami about devops. This is what I intended to say and fortunately approximately what I did say. The slides that accompanied are at http://rcrowley.org/talks/superconf-2012/index.html.

Hi, my name is Richard Crowley. I’m a production engineer at Square. My team’s responsible for the datacenter environment, network, hardware provisioning, operating system configuration, databases, backups, and monitoring. I work there because I’m interested in software systems that work well in production. These systems are operable — it’s easy for humans to control, change, and grow them as needed. It takes a lot of work to turn your average prototype into an operable system. It’s not a one-man job and it’s never the same from company to company so what follows draws heavily from my experience and opinions, which have shown me over and over that developers should own the operability of their systems. It’s time to get our hands dirty.

I consider myself both a developer and an operator. I help design and implement software systems and care deeply about how they’re configured, deployed, and debugged in production. After all, the vast majority of an application’s life is spent being maintained and operated in production. We’re talking about years, here. It’s foolish to ignore that future during the relatively shorter design and implementation phases that are typically measured in weeks or months. Alternatively, I may just be OCD, but I think that’s less likely. Let’s talk about how I arrived here.

I founded a company called DevStructure with Matt Tanase two years ago. We started with a simple thesis: configuring servers was too hard. With that, we went to work building configuration management tools optimized for software developers. In a way, we were luring developers into caring about operations. We quickly decided that a prerequisite to solving the configuration management problem was reducing the distance between development and production environments. While a Mac laptop is technically a form of UNIX, it’s a far cry from the Linux servers we all use in production. I see this distance to be a huge problem for the operability of most of our production systems. Matt and I built some tools and embraced others, Vagrant especially, that brought Linux cloud instances and virtual machines into the development workflow. From that base, we launched our Blueprint configuration management tools. Operator- and production-focused configuration management tools like Puppet and Chef were always on our radar and we sought to interoperate with both of them. As is common for bootstrapped startups, I consulted with several companies on developing their configuration management strategy, sometimes using Blueprint, other times using Puppet.

Square just happened to be one of those companies and I’ve been there full-time since January. Reliability is paramount at Square. It’s way more important than performance, within reason. If a merchant can’t accept a payment, we’ve failed. Historically, environments that value reliability fear change. We’re different. We move fast — there are many engineering teams working on many products — but despite that it’s critical we don’t break things. Matthew O’Connor, our director of engineering, says it’s rude to break the site. I couldn’t agree more. Let’s not be rude to our customers.

There’s a buzzword out there — devops — that I don’t care for too much. It’s helpful though, because finding a name that stuck has accelerated the idea that reducing the distance between development and production makes it possible to build more reliable systems. The key, though, is that it’s really about reducing the distance between developers and operators. It’s about people — developers and operators. As the agile mafia said, we value “individuals and interactions over processes and tools.“

Perhaps it’s because developers have already blazed the agile trail but to me, the devops movement has always felt lopsided — it’s very operator-focused. Case-in-point: its most visible sea change is the degree to which systems administrators are taking to high-level programming languages, learning to design software instead of “writing scripts,” and embracing agile development methods. We’re focusing on people, on working software, on customers, and on our response to change.

The goal, after all, is to create business value. I’m as guilty as anyone here in getting too absorbed in compiler flags, sysctls, or coding style and losing sight of the reality that I work for a company. I have to remind myself that customers never have and never will care about load average. We all need to focus on creating business value, and as developers and operators...

...we should measure that value. Someone, I think either John Vincent or Theo Schlossnagle, likes to say, “If you don’t monitor it, it doesn’t exist.” And if it doesn’t exist, well, what are you working on? A recent conversation with one of the dozen or so Eri[ck]s at Square gave us a good interview question: if you could monitor only one thing, what would it be? We decided we would monitor whether the site is on the Internet. An oversimplification, to be sure, and one that doesn’t help much with debugging, but it’s a start. Undoubtedly, you’ll expand from there and when you do, try to create metrics that answer questions. Remember that, by and large, these metrics are for human consumption. Responding to changes in metrics should be part of every company’s culture.

You see, devops is a cultural thing. Reducing the distance between developers and operators requires moving past the historic us-versus-them attitude. Traditional systems administration consumes less time these days, thanks to innovations in managed hosting, virtualization, and provisioning APIs like EC2. But operators aren’t taking more vacation, we’re contributing to the software development process. On the flip side, we as developers are hopefully doing more than checking in code and going home. Helping each other out leads to trust. Operators have to trust that developers aren’t rolling out new features to ruin the operator’s sleep. Developers also have to trust that operators don’t say “no” because they hate you.

DevOps is certainly first and foremost a cultural solution to a cultural problem between two teams with different incentives. But we are all engineers, we’re makers of tools, and basically we can’t help ourselves. Thus, the tools we write codify the culture we create. Configuration management, deployment, and monitoring tools are the most visibly affected software genres but it extends further to collaboration and project management, documentation, and even our office jukebox. Much like writing clarifies the author’s beliefs, working software makes fuzzy human interactions concrete and unambiguous.

If there’s a common theme to all these tools, it’s automation. We automate to reduce the number of errors and inconsistencies present in our infrastructures. We automate to make ourselves more efficient. We can’t leverage the raw power of thousands of computers if it takes us six months to configure them all. My rule of thumb here is that you should never perform a manual process a third time. This works at many layers of abstraction: it’s OK to manually install an operating system once to get your feet wet but the second time, you should be using PXEBoot and Kickstart or preseed files to automate the installation. Higher up the stack in the realm of developers, it’s fine to manually craft a software build process that’s just right but subsequent builds should be performed automatically, perhaps through a continuous integration tool like Jenkins or Buildbot.

Through all this automation, every layer of our infrastructure is being defined by code instead of wiki pages full of prose and syntax errors. But it’s more than just working code. There’s a subtle distinction here I want to make explicit: we’re not talking about “scripts” anymore; we’re not talking about cating Bash history into doit.sh; we’re talking about well designed software systems that provide abstractions that make developers and operators more powerful.

The arrival of these distinctly computer science concepts in operations — abstraction, encapsulation, and polymorphism to name but a few — doesn’t align with what I see as a distinct lack of traditional software developers getting serious about operations. Systems administration is no longer a cowboy culture of Perl and shell. One-liners grown out-of-control are being tamed. Operations are being driven by well-designed software systems built by systems administrators looking for a better way. So I have to ask: where are my fellow developers?

This actually brings us back to DevStructure and the Blueprint project. Matt and I ask developers to work in realistic development environments — a real operating system, real web servers, and real database servers. No toys. No substitutes. In return Blueprint can, in one command, enumerate all the packages installed by all the package managers, tar up software installed from source code, collect system configuration files that have changed, and learn which services are running. In short, Blueprint makes ad-hoc configuration repeatable. But there’s another benefit to developers working in realistic development environments: when we do, we start to think just a little more like operators and the distance between the two camps shrinks.

As the distance shrinks, we start to think about the code we write in terms of the service it provides. And that service has to survive the harsh realities of production — rapidly growing traffic, out-of-memory errors, limited disk and network I/O bandwidth, and all the possible failures other services in the infrastructure may experience.

This is service-oriented architecture. All sufficiently large engineering organizations go through the transformation from a monolithic application to SOA at some point. Don’t be intimidated, though, because if you squint just right, everyone starts with SOA, too: your application is a service which relies on your database service. I’m not suggesting that everyone in the audience with a monolithic Rails app should immediately start breaking it apart. Premature refactoring is a sure way to factor out the wrong functionality and drive yourself crazy. Instead, design software that’s easy to refactor.

One strategy which I think accomplishes this is to begin with a library-oriented approach within your monolithic application. Package logical components of your application independently — literally as separate gems, eggs, RPMs, or whatever — and maintain them as internal open-source projects. This strategy works fantastically for GitHub and I can report it has made a difference at Square, too. This approach combats the tightly-coupled spaghetti so often lurking in big codebases by giving everything a Right Place in which to exist. It encourages reuse within your application and leaves the door open for reuse across services in the future. But perhaps the biggest benefit, and one which I didn’t fully appreciate until very recently, is the ease with which a coworker can familiarize themselves with small libraries and begin to contribute.

Let’s talk about deployment. Whether you’re deploying a monolithic application, a set of libraries, or a set of services, the act of deploying new software can be frightening. It doesn’t have to be. Conventional wisdom says operators fear change so we block deploys for as long we can. Eventually, though, the buildup of new features and improvements is so compelling that the suits demand a deploy. Calamity frequently ensues because the team frankly isn’t very good at doing deploys. And how do we get better at something? Practice. Automate your deploy process. Make deploys cheap so you can have lots of small deploys. Now operators don’t fear the armageddon following every deploy and developers aren’t frustrated that their feature is languishing in staging.

Remember, this is still about creating business value and even though the lines have blurred, operators still increase value by providing a high-performance, highly-available service to customers and developers still increase value by releasing new features. Of course...

...not one of those features creates any business value until it’s deployed. As always when we suspect we’re creating business value, it’s our job, and usually our first instinct as engineers, to measure this value.

We know the most important metric is whether your site is on the Internet. Beyond that, monitor what makes your business tick: signups, uploads, downloads, 400s, 500s, transactions, cancellations, cash money, and so on. Note well that I haven’t mentioned CPU, load average, or any other system metric. System metrics are invaluable given the appropriate context but without that context, you can’t say that 70% CPU or a load of 47 is actually a bad thing. If, however, these correlate strongly with problems experienced by users, you’ve got a promising avenue to take in debugging. Developers should all be intimately familiar with all of these metrics and use them as their eyes in production when diagnosing problems.

Let’s talk again about developers getting involved with operations. What follows is a short and highly unscientific selection of operational concerns most developers hand-wave away. I think developers should have answers to these questions before services reach production.

How is this deployed? There are really two questions here: the cultural and the mechanical. The cultural part is about defining the process you use to deploy software. How often? Who initiates the deploy? How can you tell it was successful? The mechanical part should be a reflection of that culture. If you have test-driven-development tattooed on your chest then you may think deploying every green CI build automatically is the way to go. Many teams decouple testing from deployment and prefer to use SSH-based tools like Capistrano and Fabric directly to drive their deploys. Some companies want their deploy history to be auditable, only sometimes due to government or industry regulations, and they tend to tag deploys in version control and deploy with RPMs or Debian packages.

How is this rolled back? All software has bugs. All test suites miss bugs. These are facts of life. John Allspaw suggests, then, that we minimize our mean time to respond to these failures even at the expense of our mean time between failures. As always, there are several common strategies. Many companies prefer to have a fast-path to redeploy the previously-deployed version of the application, thus recovering from an incident. Tools like Capistrano all but assume this is what you’re doing and deploy into timestamped directories with a symbolic link to the current release so rollbacks are easy. Other companies prefer only rolling forward, meaning a new deploy must be made to address the failure. Sometimes this is as easy as reverting the offending commit and deploying again. Other times, especially when database migrations are involved, rolling forward is the only way. Having the ability to disable features entirely without doing a full deploy, a technique known as feature flagging, takes a bit of the time pressure off of the developers charged with finding a permanent solution in a roll-forward situation.

How is this process (re)started? A special concern both when deploying and rolling back is how quickly old code can be swapped out for new code. You can’t deploy frequently if each deploy brings along a 30-second outage. In such cases or when the deploy process takes a long time, rolling deploys of one or a few servers at a time amortize the cost and hide it from users. Deploys may be expensive in terms of database connections or briefly elevated network traffic so deploying one server at a time can lessen the impact. Regardless of whether you perform all-at-once or rolling deploys, take care to gracefully finish requests that are in-flight so you don’t aren’t guaranteed to serve a few 500s on every deploy.

How is this process supervised? There are two high-level options here: direct parent supervision and periodic supervision. Direct parent supervision comes via tools like daemontools, Runit, and Upstart. They operate by forking and execing your process as a child. Then the parent blocks waiting on the child to exit, at which point the cycle begins again with the parent forking. The Achilles’ Heel of direct parent supervision is that it can’t be used to supervise processes like Nginx or Unicorn that perform a certain style of zero-downtime restart. These processes restart by forking, which, thanks to good ol’ UNIX semantics, inherits the listening file descriptor, and executing a new copy of themselves which resumes accepting connections on the inherited file descriptor. Once operational, the child signals the parent and the parent exits. The child is then reparented to init and the direct parent supervision relationship is broken. The periodic variety, like Monit, checks the system’s process table or listening sockets every few seconds, starts the process anew if necessary, and goes back to sleep. These tools can incur significantly more downtime in case of failure by sleeping several seconds before recognizing a process has exited unexpectedly.

What if two versions are live at once? This may seem like a fringe concern but unless you take a complete outage for every deploy, you have this problem. Solving it means ensuring new versions of the code can speak to old versions and vice versa. The greater your tolerance here, the more you can do with A/B testing, canary deploys, and beta launches. Database schema migrations present by far the most common challenge to running two versions at once. It takes only three steps to get it right. First, you must deploy a version of your application which tolerates both the new and the old schema. Then you’re free to do the database migration and any data backfills that are necessary. Finally, you’re free to remove special cases for the old schema.

What metrics are important? I don’t know your answer to this question because I don’t know what your business values. You know, though, and once you’ve settled on the “what,” it’s time to determine the “how.” I can’t recommend Graphite highly enough as the final destination for all your time-series metrics but it leaves the collection as an exercise to the reader. Metrics that naturally map to regular intervals can be sent to Graphite via a plaintext protocol. Metrics that don’t emit on a regular interval, such as page views and their HTTP status codes, database or other service query times, and so on may be aggregated into Graphite’s time series by Etsy’s StatsD. And system metrics like CPU, load average, memory use, and I/O are easily gathered by collectd and its bajillion plugins, and forwarded to Graphite. Outside of the mechanics of gathering system metrics, reflect on and document how you expect them to change as your business grows because they’ll be key to your capacity planning.

The theme that runs through all of these operational concerns and the myriad I didn’t address is that the best solutions tend to involve collaboration between development and operations which tends to result in a specific contract made between application and platform. Standards like CGI, Rack, and WSGI paved the way but don’t extend far enough. Heroku provides a standard operating environment for all sorts of applications built from environment variables, file descriptors, and a healthy dose of documentation that informs an application of the addresses and credentials of related services. At Square we’re standardizing how we deploy services through a library called Jetpack, which is available on GitHub, that deploys and supervises JRuby-based services in a consistent manner.

We’ve covered a lot of ground here but these tools and these questions should be very familiar territory to the systems administrators you work with. Talk to them. Simply getting to know the other guys is often enough to set developer and operator culture on a path towards mutual trust that’s critical to a business’ success. No one is better suited than the developers to answer the questions we’ve asked, to build deploy tools, and to collect metrics on all parts of your business. The operability of your systems depend on it.

If you want to help build infrastructure to revolutionize payments, talk to me or head to squareup.com/jobs.

That’s all. Thank you.