A State Machine Datastore in the Wild

Richard Crowley

Betable operations

Betable

Gambling-as-a-Service
We do the math and the money
We’re licensed and regulated so game developers don’t have to be

Then: games of chance

Single-player
Single-event
Against the house
Easy to model as a REST API

Now

Multi-player
Multi-event
Still against the house
Not RESTful
Most events are asynchronous from a player’s perspective

WebSockets

www.ietf.org/rfc/rfc6455.txt
Not completely busted anymore!

Storage

All Cassandra, all the time
N=3 R=2 W=2 (quorum reads and writes)
Naturally, we tried this first

Strawman use-case

Sell 1000 bingo cards “instantly”
Don’t sell more than 1000
Don’t sell more than 4 to any player
Don't sell two identical cards
Don’t sell any after the first ball is drawn

Cassandra 1

Doubly-linked list
Optimistic concurrency control
Two column families: items and pointers (with LongType columns)
Death by roundtrip

Cassandra 1 algorithm

Write new tail to items and include the previous tail’s row key
Write to the previous tail’s row in pointers with a timestamp and the new tail’s row key
Read the previous tail’s row in pointers to see if our write won the append race

Cassandra 2

Consistent hashing of entities to processes
Locked read-modify-write over the network
Also death by roundtrip
At this point, why even use Cassandra?

RDBMS

Lacks expressiveness
(read: Turing-completeness)
SELECT FOR UPDATE
Also also death by roundtrip

Technology-agnostic design goals

Enforce complex constraints on data
Efficient in terms of network I/O
Scale horizontally
1000 bingo card strawman

Custom DB, NBD

Nothing more permanent
than a temporary solution
Small scale makes a custom
solution plausible

Entities

Arbitrary size and structure
Unit of atomicity
Unit of distribution
Examples: RouletteTable, RouletteRound, but not RouletteBet

Entity schema

type Table struct {
    ID             string
    GameID         string
    RoundID        string
    Players        []Player
    NextTransition time.Time
    // ...
}

Transitions

Mechanism of all reads and writes
Analogous to stored procedures
Apply to (at most) one entity
Examples: CreateRouletteTable, CreateRouletteRound, CreateRouletteBet, and ResolveRouletteRound

Transition API

type CreateTable struct {
    TableID string
    // ...
}

func (ct *CreateTable) Prepare(pc PrepareContext) error {
    return nil
}

func (ct *CreateTable) Apply(ac ApplyContext) (*Table, error) {
    t := &Table{
        ID: ct.TableID,
        // ...
    }
    if err := ac.Store(t.ID, t); nil != err {
        return nil, err
    }
    return t, nil
}

Prepare a transition

Dispatch other transitions
Generate data
Mutate the transition
PrepareContext has Dispatch, Load, and more but not Store, LoadExcl,
or StoreExcl

Dispatched transitions

func (ct *CreateTable) Prepare(pc PrepareContext) error {
    r, err := pc.Dispatch(&CreateRound{
        // ...
    })
    if nil != err {
        return err
    }
    ct.RoundID = r.ID
    return nil
}

Cross-entity operations
No transactional guarantees
Efficient for humans and networks

Apply a transition

Transition written to write-ahead log
Enforce domain-specific constraints
Update a stored entity
ApplyContext has Load, Store, LoadExcl, StoreExcl, and more

Generic transitions

GetEntity reads an entity given a type and a key
GetIndex reads a secondary index given its name and a secondary key
UpdateEntity could theoretically update fields in any entity but we haven’t needed it

ACID properties

Atomicity: entity writes are as atomic as you want
Consistency: weak guarantees since transitions contain arbitrary code
Isolation: within an entity when requested
Durability: fsync(2) the write-ahead log

CAP properties

Consistency: transitions always applied by their coordinator
Availability: no second-choice coordinators
Partition tolerance: unreachable coordinators don’t make reachable coordinators unavailable

Guarantees not made

Atomicity of entity-spanning operations
Locality of any two entities
See Pat Helland’s Life beyond Distributed Transactions: an Apostate’s Opinion in CIDR 2007

Coordinators

Transitions declare a distribution key
Distribution key conventionally identifies
an entity
Consistent hash function maps transitions to a coordinator process

Peers

Command-line flags easily configured
by Puppet
Declare peers on your consistent hash ring
Declare masters from which to replicate

Flexible topology

Master-slave
Master-master
Next N-1 peers

Network transparency

One listening socket per process
Clients connect to any process
Peers proxy transitions to their coordinator
Clients can embed the consistent hashing algorithm as an optimization

Replication

Asynchronous
Loosely inspired by MySQL
Replicate transitions, which are small, not entities, which may be very large
Prepare not called when replicating

Replication

Slave sends a BeginReplication transition with an Index and Offset
Master streams transitions from the initial Index and Offset forward
Index and Offset logged with replicated transitions and read from the log to resume

Secondary indexes

type Table struct {
    ID     string `index:"-"` // Primary key
    GameID string `index:"GameID_TableID"` // Secondary key
    // ...
}

Declare a primary key for the entity
Declare fields to be indexed

Secondary indexes

Automatically updated during
Store and StoreExcl
Eventually consistent via replication
Not atomic with entity storage

Storage engine

Objects and blobs
ObjectStore and BlobStore interfaces
Load, Store, LoadExcl, and StoreExcl

Storage engine

BSONObjectStore serializes and stores objects in a BlobStore
The key is prefixed with the object’s type

Storage engine

FileBlobStore stores blobs
on the filesystem
Store and StoreExcl are atomic
via rename(2)
LoadExcl and StoreExcl use flock(2)

Disk protocol

+---------+----------+----------+---------+
|         |          |          |         |
| version |  flags   |  length  |  type   |
| (uint8) | (uint16) | (uint16) | (UTF-8) |
|         |          |          |         |
+---------+----------+----------+---------+

+----------+----------------+
|          | +------------+ |
|  length  | | transition | |
| (uint32) | |   (BSON)   | |
|          | +------------+ |
+----------+----------------+

Type prefix facilitates deserialization
Naive and expensive

BSON over the alternatives

It’s easy to create protocol buffer-like versioning and extensibility by being careful with BSON
go-bson nicely mirrors encoding/json
go-bson actually feels more naturally Go than goprotobuf

Reasons to choose Go

Brevity
Static type system
Compiles to x86, ARM, etc.
CSP
Thompson, Pike, Cox, Griesemer
Brad Fitz

One service, two Git repositories

Entities and transitions are declared
in the datastore
Awkward development workflow

Testing

Can test each side of distribution/replication as units
Start 3 services in one Go process and sleep a bit to test together
Jump through hoops to test network errors

Replicating dispatched transitions

Used to allow Dispatch in Apply methods
Prevent double-applying in replication
Non-deterministic replication
Eliot Moss’s Nested Transactions:
An Approach to Reliable
Distributed Computing

Would we do it again?

We’re backing more and more services with this storage engine every day
It’s less of a brain teaser each time

Future work

Replication improvements
Next-generation secondary indexes
Declaring entities and transitions outside the datastore
High-availability coordinators

A word from my sponsors

jobs@betable.com

A State Machine Datastore in the Wild

Hi, I’m Richard

Betable

Then: games of chance

Now

WebSockets

Storage

Strawman use-case

Cassandra 1

Cassandra 1 algorithm

Cassandra 2

RDBMS

Technology-agnostic design goals

Custom DB, NBD

Design

Entities

Entity schema

Transitions

Transition API

Prepare a transition

Dispatched transitions

Apply a transition

Generic transitions

ACID properties

CAP properties

Guarantees not made

Coordinators

Peers

Flexible topology

Network transparency

Replication

Replication

Secondary indexes

Secondary indexes

Storage engine

Storage engine

Storage engine

Disk protocol

BSON over the alternatives

Go

Reasons to choose Go

Errata

One service, two Git repositories

Testing

Replicating dispatched transitions

Would we do it again?

Would we do it again?

Future work

A word from my sponsors

Thank you