Apprentices of Scale

By Richard Crowley

I had the pleasure of being invited to close out the PagerDuty Summit on September 7, 2017.  This is what I intended to say.  This is a video of what I said.

Hi, my name is Richard Crowley and I work at Slack.  In the past three and a half years at the company I’ve grown and managed our Operations team from two to thirty-five, helped handle several orders of magnitude growth in traffic, and commanded my fair share of incidents.  It’s a pleasure to be here to reflect on our craft with all of you.

Slack is a messaging product that brings together all the people and pieces you need to get work done.  It can replace internal email, build bridges between teams that don’t communicate well, and so much more.  And whether it’s engineering, incident response, sales, support, marketing, strategy, or goofing off, it’s business-critical.  We’re hiring and you should come join us.

Enough about that, though.  I came here today to talk about all our apprentices.  The craft of operations engineering is learned to a remarkable degree through apprenticeship.  And yet, we don’t often use that word to describe how we turn junior engineers into senior engineers.  It’s true, though:  The tools and techniques of our trade are taught and learned on the job by practitioners whose primary responsibility is to their customers.  We owe it to our businesses to make these apprentices successful.

My very first full-time job was working for Cal Henderson, now the CTO of Slack, at Flickr.  My responsibility to our customers was to deliver a high-quality upload experience for Windows and Mac users with lots of photos to upload.  Delivering this required learning a lot of technologies that were brand-new to me on only the most general college education.  Fortunately for me, Cal and the rest of the team there were were willing and able to guide me through my time as their apprentice.

Note that none of these things featured in the college curricula of that time because the tools and techniques that are relevant to our businesses change very rapidly and that predisposes us build a foundation in college and to organize around apprenticeship in our companies.  Consider the banking industry for contrast.  College curricula and certifications combine to reliably train folks that enter the field.  Organized continuing education designed around predictably-timed changes to regulations ensures that training remains up-to-date.  Neat and tidy and entirely inapplicable to web operations.

While the rate of change pushes us towards learning on the job, we organize our companies for and around rapid growth and that would seem to dampen the impact of apprenticeship by stretching available mentors too thinly.  And in a medieval or Darth Vader-esque one master, one apprentice form, that’s true.  But apprenticeship itself is already scaling, quietly and sometimes coincidentally, within all our companies.

When you strip away the historical economics of apprenticeship, what you’re left with is a symbiotic relationship between someone having skills and someone desiring those skills.  This is about junior engineers becoming senior engineers and about senior engineers consciously and thoughtfully helping them do so.  I’d like to spend the rest of my time today offering you my thoughts on how to give the most to our apprentices.  I’ll start by recognizing latent scalable learning opportunities hiding in well-established operations engineering practice.  I’ll talk about how training can be added to build a stronger foundation. I’ll address the need to specialize as our companies scale.  Finally, I’ll talk about some future work I’ll be doing at Slack.

No matter how informal, the macro features of all our engineering processes are likely pretty similar.  There’s a design phase wherein we take advantage of talk being cheap.  Then we actually develop software.  We probably test it in various ways as it’s being developed.  We test “at scale” and hopefully practice responding to incidents we can anticipate.  We write runbooks and prepare for production service.  Later, we encounter defects, respond to the incidents they cause, and have postmortems in an effort to learn and make new, better mistakes next time.

Each of these activities offer an apprentice an opportunity to learn if they happen to participate.  But with the tiniest of formalities, each of these activities can offer many apprentices for perhaps years to come that same opportunity.  That’s the kind of scale we need if we’re going to consistently turn junior engineers into senior engineers.

Slack’s Chief Architect hosts a design workshop each week open to anyone with a half-baked idea.  It’s a standing meeting with no end date and often no agenda until the day before or even the morning of; it’s all of the things they say meetings shouldn’t be.  And yet it’s regularly attended by engineers of all levels with all sorts of specialities week in and week out.  It is, to me, equal parts opportunity to learn about unfamiliar parts of our product and opportunity to share patterns I can match, counsel caution, or encourage boldness.  In just the last few months this group has gone deep with search and sort implementations across Slack clients, role-based access controls, graph-structured databases, the architecture of our next-generation voice and video calls server, message order guarantees made by our API, and cache coherence protocols.  These discussions, the design documents circulated beforehand, and the notes, photographs, and recordings that survive after offer an apprentice the opportunity to revisit the design process of a piece of software from its very beginnings.

Throughout the development process, whether creating the new or maintaining the old, code review standards, linters, and test suites should serve not only to ensure changes are safe but to constantly and clearly reinforce what quality means to your company.  Reviewers should critique code too clever to be maintained later.  Linters should ensure legacy APIs are phased out.  These small, automated course corrections help us be a little better every day.  Most importantly, they show our apprentices what our companies value in the smallest details.

As changes near their production debut, we write runbooks that cover anticipated operational tasks.  The most crucial thing about runbooks is that the entire team buys into the necessity of reading and following the runbook every time they execute it, no matter how well they think they know it.  That is the only way to ensure that changes to the runbooks will be reflected in all future executions.  The downside of putting such faith in runbooks is the pressure it places on them being airtight.  Early in my tenure at Slack I wrote our first MySQL runbook and you can tell where this story is going already.  The night after the first time someone else followed my runbook, we were paged because replication between the two sides of shard 7’s master-master pair had broken and could not recover by itself.  We discovered what looked like rampant data corruption that caused replication to break repeatedly, leaving thousands of records inconsistent between the two sides.  After several rounds of recovery and relapse, we discovered the egg on my face.  Restoring a MySQL backup requires placing all the InnoDB files, starting replication, and, most importantly, waiting for replication to “catch up.”  If that new database enters service without catching up, INSERT statements routed to it that should receive duplicate key errors won’t and replication will break later when the logically earlier INSERT statement is replicated.  It’s chaos!  And all thanks to my careless assumption folks would know to wait for Seconds_behind_master = 0.  How would they know?  Needless to say, I improved the runbook.  The lesson is that in order for our apprentices to succeed with and learn from our runbooks we have to remember what it’s like to not know when we’re writing them.

And you better believe we had a postmortem for that incident.  I’m not going to spend a lot of time talking about how Slack runs postmortems.  This is, after all, PagerDuty’s conference and they’ve written and published excellent documentation of their incident response process, including how they run postmortems.  Our processes are similar except at Slack more of the administrative responsibilities belong to the Incident Commander and we use Slack a whole lot more.  Postmortems are gold to engineers new to a service, new to the team or company, or new to operations engineering as a whole.  In fact, the most effective way we’ve found to give interns and folks testing the waters from other engineering teams a glimpse of what operations engineering is like is to whiteboard from a postmortem back to the design and architecture of the services involved.

Just doing what we were doing anyway with our engineering and incident response processes and calling it apprenticeship doesn’t cut it, though.  We’ve talked about scaling these existing artifacts of apprenticeship and now we need to address training available on the tools and techniques of our trade.  Training programs that scale and endure are a crucial support to our apprentices, not to mention a far more efficient strategy than letting everyone figure it out for themselves.

Slack’s formal training curriculum in a classroom setting evolved from two independent histories.  The first began as a productive way to spend a few down minutes during onboarding - why not start explaining how Slack worked right then and there?  It was a hit.  An engineer named Cyrus ran with this and formalized it into an hour-long class he called Slack Mostly Works.  This training, by surveying the life of a message, a notification, a file, and more, provides a lot of context, begins to demystify a lot of jargon, and makes new folks feel like they belong.  It has been wildly successful, now has a rotation several teachers, and this “101” class has over the past couple of years spawned a whole series of deeper “200-level” classes that are as useful to veterans as to new hires.  The curriculum now covers Slack’s use of specific technologies like MySQL through specific parts of our product offering like Enterprise Grid or our Platform.

The other history of Slack’s formal training curriculum starts with our huge, global Customer Experience team.  I spoke with Slack’s head of Learning & Development about her strategy in creating the curriculum that trains hundreds of agents around the world and how to apply that strategy to any department.  To her, the content of the training is somewhat secondary to learning the social norms of the team and company.  That’s called enculturation and when you’re hiring quickly, it’s critical, whether you’re growing from two to ten or hundreds to thousands.  Job one is to instill a baseline of belonging and pride in working at Slack and we’re not going to get that by letting everyone figure it out for themselves.  With that, the next priority is to ensure the explicit curriculum - the content of the classes - matches the implicit curriculum - the practice, conventional wisdom, and gossip informally disseminated throughout the company.  What we teach must be reinforced by the experience folks have on the job with their coworkers.  In other words, training material must be descriptive, not aspirational.

When they’re aligned with reality, classes are a scalable, controllable way to give folks a foundation.  That foundation starts with context like why certain technologies were chosen, product constraints that long ago influenced system designs, and ideas that didn’t pan out and would otherwise be lost to history.  To really be useful, though, the apprentice needs to learn the jargon; it’s not enough to know everything about MySQL if you’re lost as soon as I start talking about the Slack-specific cases of “mains” versus “shards.”  And finally, by virtue of being instructed by a real human being, classes create at least one point of contact that the apprentice can use to continue their education once class is out.  Altogether, that’s a foundation we can build on.

So now our apprentices are trained.  It’s time to put that training into practice.  In high school, I was introduced to the concept of mastery learning.  The chemistry teacher, who’d also taught my parents, insisted that every student achieve a perfect score on every homework assignment.  If you made a mistake, you had to do the problem over again to correct it.  He also allowed students to retake tests until they achieved a score they were happy with.  Repetition led to mastery.  Practice makes perfect.  This is as true with the tools of our trade as it is riding a bike or learning stoichiometry.

The thing is, at a growing company in a rapidly changing industry, there isn’t a lot of time for repetition.  In fact, there’s rarely time for any engineer to remain proficient in all areas of the infrastructure.  So we specialize.  I don’t think I’m not the right person to talk about the profound effects of specialization on the places we work, the products we make, or the services we provide, in general, but I do want to discuss one form of specialization particularly relevant to operations engineering.

As Slack’s traffic and engineering team grew we witnessed - well, caused - a Cambrian explosion of software.  We added a cache tier, decomposed the WebSocket stack to support shared channels and channels with millions of members, built a “WebSocket CDN” to cache user objects at the edge, and a lot more.  All this performance- and reliability-enhancing change has been overwhelmingly positive and yet it has a downside, too, and it’s that Slack breaks in fundamentally different ways than it used to.  That demands specialized knowledge to debug.

Case in point:  Slack’s job queue infrastructure is in a transitional period and uses Redis and Kafka to broker jobs.  This new system involves some Go.  Go’s and PHP’s JSON encoders disagree on some characters to escape by default.  This difference of opinion meant that the same job encoded by Go and PHP could potentially string-compare as not equal.  Our job queue uses equality comparison when removing completed jobs from the list of processing jobs.  And that list is really a list thanks to a compromise we made with Redis.  The imprecise equality comparison caused jobs to leak and remain forever in the processing list.  Eventually, there were quite a few such jobs and the O(n) search to complete a job ate up enough CPU time to slow enqueues, being that they didn’t flow exclusively through Kafka at the time.  When enqueues are slow, some web requests are slow.  And when some web requests are slow, HHVM threads become hot commodities.  And when HHVM threads become hot commodities, every request gets slow.  So if you ever wondered why Slack was slow on July 25, that’s why.

That took specialized knowledge to debug.  Lots of things nowadays take specialized knowledge to debug.  Specialization creates space and time for the repetition that leads to mastery.  And when you’re the specialist, or apprenticing to be the specialist, it’s only fitting for you to be on-call to support the services you seek to master.

Slack’s first toe dipped into specialized on-call rotations was the creation of a long rotation for developers to complement the long rotation of operators.  I looked at what data we had about past escalations and decided that the focus of this rotation should be on taking over incident command for long-running incidents which tended to impact one or a few customers in extremely application-specific ways and which typically involved a lot of communication directly with the affected customers.

With that in mind, I developed some training for Incident Commanders and presented it in several live sessions to the folks who were about to enter this rotation.  Designing this training required formalizing incident severity, how the Incident Commander communicates with the Customer Experience team, how the Incident Commander reaches others to assist in the response, the authority the Incident Commander has in doing so, regular cadences of communication both internally and externally, and so on.  But the most valuable component of this training was probably ending with live support in configuring PagerDuty, sending sample alerts, and role-playing incident response.

Now, all of this information was documented beforehand, albeit less formally.  The earliest drafts, written for the Operations team, were filled with winks and nods to prior experience that weren’t terribly supportive of our apprentices as they learned to be effective incident responders.  More troublingly, I mispredicted the disposition of the average escalation to the developer on-call rotation.  Far from needing Incident Commanders for long-running incidents, in reality, we frequently paged someone from, say, the Calls team to respond to a search incident that was resolved in less than an hour.

In this case, the past didn’t predict the future and the form of specialization needed to change.  In retrospect, long rotation offered few opportunities for repetition and those few were usually undermined by how incredibly different each incident was and how far outside the on-call Incident Commander’s wheelhouse they were.  So we took the next step of breaking down both of our big on-call rotations.  In June we broke the big operations rotation down by team, creating rotations for App Ops, Build & Release, Storage Ops, and Visibility.  In August, we broke down the big developers rotation into rotations for Calls, Data, Internal Tools, search, webapp client, webapp server, and WebSocket infrastructure.

We expect these seven developer on-call rotations to be called into service much more often for their subject matter expertise than their ability to command an incident.  Thus, while the original Incident Commander training is not a loss, we’ve created a new gap in which our apprentices - the dozens of developers new to being on-call - aren’t well-supported.  And this one’s where the curse of knowledge really hit me hard.  Think back to your first on-call shifts.  How were you trained?  I don’t think I ever was.

By its very nature, an escalation path like these on-call rotations doesn’t get used when there’s a runbook that covers the situation.  So everything that comes to these folks is already “off-script.”  We’ve never been there before.  And I realized that a new on-call engineer has that same feeling about incident response itself; they’ve never been there before.  The training, then, is all about what you can count on when you know you can’t count on anything about the incident.  So in the training we walk through how the Incident Commander operates, how you as a responder work with them, the progression from triage to mitigation to resolution, how to escalate still further or hand off to a teammate, and the relative urgency of remediation action items.

Just as before, learning the social norms of the team is the most important part of training and, in this case, it’s really the whole thing.  Confidence under pressure comes to a great extent from practice but the foundation folks start with by knowing how the pressure’s going to come at them makes that steep learning curve a little shallower.

That brings us to now.  What’s next after apprentice-friendly processes, training, and specialized practice?  This fall I’m planning to add formal drills to our repertoire covering “what happens when X fails”.  Historically, we’ve approached this sort of exercise informally, just before production release of new systems, and haven’t preserved much documentation of our findings or our methodology.  We also haven’t yet made a habit of repeating these small, controlled failures as systems mature.  I hope to create a forum in which many more engineers can learn to think critically about how systems fail, plan and execute drills featuring controlled failures of development and production systems, and channel the lessons we learn into more robust system designs and a more reliable Slack.  These are spiritually similar to the disaster recovery drills we already do but much smaller in scope which, I hope, allows for higher frequency that increases towards “chaos engineering” over time.

The work of supporting our apprentices is never finished but this is where we are at Slack now.  Before I wrap up, I want to make a meta-observation.  Most of the time I gravitate towards technology almost to the exclusion of everything else.  I even expected as I began to write this talk that it would spend a lot more time talking about database replication or queue backpressure than I did.  But after three and a half years in management at Slack, even as I’m stepping back from those responsibilities to focus once again on being an engineering practitioner, a profound sense of the importance of people lingers.  Without taking care of our apprentices, people will be the cause of, rather than the solution to, all our scaling problems.

I know that I haven’t described any truly groundbreaking engineering practices today.  The way we approach apprenticing engineers doesn’t have to be novel to be effective, it just has to be rigorous so no one is left behind.  Our operational processes should themselves be educative.  We should own the responsibility to train our people.  And we should give them the opportunity to put that training into practice.  When we can turn junior engineers into senior engineers into senior engineers immersed in all the context and jargon of our companies, we are doing right by our businesses and especially right by those people.

I hope these observations can take some of the chance out of apprenticing in our industry.  I hope we can leverage every lesson we learn, pass it on to our apprentices, and make all new mistakes tomorrow.  Thanks for listening.  Good afternoon.