Fourth-generation configuration management

2021-08-03

A large portion of my career has been spent chasing ever-better configuration management. It started innocently enough: If I was going to run one of a thing in production, I was damn sure going to be able to run two of them. I became enamoured with Puppet and even briefly had the commit bit. I couldn’t start a project without writing the Puppet module I’d need to manage its production environment.

It was in that era I heard John Willis refer to Puppet and Chef as third-generation configuration management tools. At the time, I didn’t know what that meant; I don’t know why it stuck with me. I didn’t even know what he saw as the first two generations until I asked him. This week. The first generation was shell scripting and the second generation was Tivoli. Then CFEngine kicked off the third generation before Puppet and Chef took it mainstream.

Since then we’ve entered the glorious era of the container. I could spend the balance of this article celebrating the arrival of an agreeable packaging format or lamenting that it awkwardly mixes resource constraints with said packaging format. I won’t. Instead, we’ll be celebrating the filesystem image, a humble technology as important to modern containers as cgroups and the various Linux kernel namespaces. Filesystem images, some dressed as containers, some not, are how folks ship software in Kubernetes, ECS, Lambda, and most any cloud-native hosting environment.

But what of the servers that host those containers? At a lot of companies they’re still being managed by those very same third-generation configuration management tools. That’s a shame, and not because those tools are bad. Rather, we simply don’t have the problems those tools were meant to solve anymore.

Datacenter providers have mostly caught up to cloud providers, making servers everywhere transient and replaceable. When we have the option to start from a known state every time, idempotent convergence on a desired state from “any” state begins to feel needlessly risky, especially when we realize just how narrow the usable definition of “any” can be in practice. To add insult to injury, that convergence step is slow.

Fortunately, a fourth generation of configuration management is standing at the ready. Yes, the filesystem image.

Third-generation configuration management tools made it too easy to couple build-time concerns like which packages are installed with runtime concerns like service discovery. A filesystem image creates a hard separation between build time and runtime, forcing a more disciplined architecture. The filesystem image is acting as the “thin waist” or spanning layer of an hourglass model of configuration management. In “On The Hourglass Model” in Communications of the ACM, the authors visualize the Internet protocol suite with IP as the spanning layer. Below, Ethernet, WiFi, DOCSIS, and other protocols shepherd bits from here to there. Above, UDP, TCP, and every imaginable application-layer protocol run wild. Likewise with the filesystem image as configuration management’s spanning layer, build time and runtime are decoupled.

At build time, we get to start from an empty directory (or at least a well-known state) every time so there’s no more need to accommodate the infinite ways real production systems may have mutated over their service lifetime. Therefore we no longer truly need the idempotent convergence properties of third-generation configuration management. Instead, dust off a Dockerfile, write a shell program, make some directories and fill them with files, snapshot a volume abstraction from our favorite cloud provider. Maybe run Puppet or Chef, but just this once. It’s no matter. The result is a filesystem image and that’s that.

We’re probably only building our filesystem image once (per every time the build’s changed) but we’re going to use it potentially thousands of times. Copying filesystem images is fast and deterministic, two points scored relative to third-generation configuration management, which means adding capacity and replacing servers can be fast and deterministic, too. At scale, the time it takes to add capacity or replace servers is often the difference between a storm weathered and an outage that makes the evening news.

At runtime, after booting Linux from our filesystem image, we’re on the other side of the spanning layer. Here we can make entirely different technology choices tailored to dependency injection and service discovery. The tools we choose here can be fancy like those in the xDS family, familiar like the DNS, or simple like downloading configuration files from S3. They’re entirely unrelated to the technlogies we choose to use at build time.

Earlier generations of configuration management lacked a spanning layer like the filesystem image and so were left without any separation between build time and runtime. Without that separation, and without transient and replaceable servers, we were forced to deal with all our build-time concerns repeatedly, at runtime.

The herds of Linux boxen beneath the most modern cloud-native container orchestration platforms and the most conservative monolithic applications need configuration management. It may not be necessary to continue thinking of these tools in generations but I do think it’s worthwhile to recognize that a lot of the complexity of previous-generation tools can be jettisoned by embracing an hourglass model with the filesystem image as the “thin waist” or spanning layer. I’m happy to be living in the fourth generation.