MySQL at Slack

tl;dr  It’s the evolution of the Flickr way summarized in Building Scalable Web Sites.

Architecture

The main database is how you lookup what team you’re on.

The auxilliary databases are vertically partitioned places to put non-team-scoped data.

The shards are where the real action happens.  Each team is assigned a shard and most everything about a team is stored there.

With few exceptions, we run Percona Server 5.5.something.

Most of our databases are on i2.xlarge instances.  We use the emphemeral SSDs for InnoDB logs, InnoDB data, and binary logs.

We let a few get too big so they’re on i2.2xlarge instances but we try to avoid that because we can’t fit twice as many teams on those despite being twice as big.

Master-master pairs

The main database, each auxilliary database, and each shard is a master-master pair.  One side is in our us-east-1d and the other is in our us-east-1e.

We write to both sides of the pair all the time.  We ensure consistency by preferring the odd side for odd-numbered teams and the even side for even-numbered teams.

This cuts the replication load in half and makes it easier for each side to keep up.

If the preferred side isn’t reachable we try the other and rely on replication to restore consistency.  This usually works.

To add insult to injury, you can’t run pt-table-checksum against a master-master pair without moving all your real traffic to one side.

Tickets

None of the servers in those master-master pairs generate any identifiers themselves.  Instead they get identifiers ahead of time from ticket servers.

Each ticket sequence is a single row table that repeatedly runs SQL like this:

REPLACE INTO tickets (dummy) VALUES ('');
SELECT LAST_INSERT_ID();

There are two ticket servers in each sequence, an odd one and and even one.  We don’t make much effort to keep them in-sync, though I bet the tickets being roughly in order helps InnoDB a great deal.

We’re experimenting with MySQL 5.6 on some ticket servers as a potential fix for a bug in which a lock is held, connections are all consumed, it waits for a few dozen seconds, and then everything returns to normal.

The ticket servers probably shouldn’t be MySQL and maybe shouldn’t even be stateful.

Monitoring

We monitor all the normal things in Ganglia.

We also attempt to monitor the how many bytes each slave’s relay log is behind the master’s because this is a better measure of durability.

SQL replication lag isn’t a big deal on shards because teams prefer one side or the other.

SQL replication lag is a big deal on the main database because we have no team from which to derive a preference.  I’m considering defining the preference by table name.

Our typical response to main database replication lag is to send all writes to the other side.

Backup and restore

We take full nightly backups at 10:00pm Pacific using xtrabackup.  We haven’t experimented at all with incrementals because recovery time’s very important to us.

Each server has an EBS volume attached for staging backups on their way to S3.

Pay very close attention to what version of xtrabackup you’re using because slight mismatches will produce cryptic errors or infinite loops that fill your disks.

Our goal is to reprovision every server, even databases, every 90 days to encourage ourselves to be really good at it.  Our provisioner runs knife ec2 server create, restores the most recent backup, and starts replication in one command.  As soon as replication is caught up we can put it into service.

If a backup was taken more than a few hours ago it’s actually faster to make a new backup so there’s less catch-up.

Miscellaneous unsolicited advice

Beware:  Percona’s packages always start MySQL no matter what you do.

Host your own packages lest you get wildly different versions running together in production.

Move OLAP workloads to their own clusters early and often.  We can handle more than three times our current traffic excluding top-of-the-hour stats.