A State Machine Datastore in the Wild
Richard Crowley
Betable operations
Richard Crowley
Betable operations
Gambling-as-a-service
Licensed and regulated so game developers don’t have to be
Single-player
Single-event
Against the house
Easy to model as a REST API
Multi-player
Single-event, sort of
Against the house
Time dimension didn’t fit into REST
Roulette is but the first of a new class of game mechanic
Most events are asynchronous from a player’s perspective
Most events are temporally localized from the house’s perspective
Player presence
Server-to-client broadcasts and messages
Timeouts in case of player inaction
www.ietf.org/rfc/rfc6455.txt
Beginning to see widely-deployed support
Providing useful semantics on top of WebSockets would be an interesting talk in its own right
WebSockets out front
Internal REST APIs in between
Storage underneath
All Cassandra, all the time
N=3
R=2
W=2
(quorum reads and writes)
Naturally, we tried this first
Correctly enforced domain-specific constraints
Serializable updates
Audit trail
High concurrency
Sell 1000 bingo cards “instantly”
Don’t sell more than 1000
Don’t sell more than 4 to any player
Don’t sell any after the first ball is drawn
Conservatively 200 kB, mutated slightly 1000 times
Cards only semi-independent due to domain-specific constraints
Enough thinking, let’s hack!
Doubly-linked list
Optimistic concurrency control
Two column families: entries
and pointers
(with LongType
columns)
Death by roundtrip
Write new tail to entries
and include the previous tail’s row key
Write to the previous tail’s row in pointers
with a timestamp and the new tail’s row key
Read the previous tail’s row in pointers
to see if our write won the append race
Consistent hashing of entities to processes
Locked read-modify-write over the network
Also death by roundtrip
At this point, why even use Cassandra?
Lacks expressiveness
(read: Turing-completeness)
Basically used it like a filesystem
Also also death by roundtrip
Enforce complex constraints on data
Efficient in terms of network I/O
Distributed, for some value of distributed
Scale horizontally
1000 bingo card strawman
Existing storage engines lack expressiveness
Small scale makes a custom solution plausible
YAGNI versus nothing more permanent than a temporary solution
Arbitrary size and structure
Unit of atomicity
Unit of distribution
Examples: RouletteTable
, RouletteRound
, but not RouletteBet
type Table struct { ID string GameID string RoundID string Players []Player NextTransition time.Time // ... }
Mechanism of all reads and writes
Analogous to stored procedures
Apply to (at most) one entity
Examples: CreateRouletteTable
, CreateRouletteRound
, CreateRouletteBet
, and ResolveRouletteRound
type CreateTable struct { TableID string // ... } func (ct *CreateTable) Prepare(pc PrepareContext) error { return nil } func (ct *CreateTable) Apply(ac ApplyContext) (*Table, error) { t := &Table{ ID: ct.TableID, // ... } if err := ac.Store(t.ID, t); nil != err { return nil, err } return t, nil }
Dispatch other transitions
Generate data
Mutate the transition
PrepareContext
has Dispatch
, Load
, and more but not Store
, LoadExcl
, or StoreExcl
func (ct *CreateTable) Prepare(pc PrepareContext) error { r, err := pc.Dispatch(&CreateRound{ // ... }) if nil != err { return err } ct.RoundID = r.ID return nil }
Cross-entity operations
No transactional guarantees
Efficient for humans and networks
Transition written to write-ahead log
Enforce domain-specific constraints
Update a stored entity
ApplyContext
has Load
, Store
, LoadExcl
, StoreExcl
, and more
GetEntity
reads an entity given a type and a key
GetIndex
reads a secondary index given its name and a secondary key
UpdateEntity
could theoretically update fields in any entity but we haven’t needed it
Atomicity: entity writes are as atomic as you want
Consistency: weak guarantees since transitions contain arbitrary code
Isolation: within an entity when requested
Durability: fsync(2)
the write-ahead log
Consistency: no stale reads
Availability: requests hashed to non-failing coordinators respond but no second-choice coordinators
Partition tolerance: unreachable coordinators don’t make reachable coordinators unavailable
Atomicity of entity-spanning operations
Locality of any two entities
Pat Helland’s Life beyond Distributed Transactions: an Apostate’s Opinion in CIDR 2007
Transitions declare a distribution key
Distribution key conventionally identifies an entity
Consistent hash function maps transitions to a coordinator process
Command-line flags easily configured by Puppet
Declare peers on your consistent hash ring
Declare masters from which to replicate
Master-slave
Master-master
Next N-1
peers
One listening socket per process
Clients connect to any process
Peers proxy transitions to their coordinator
Clients can embed the consistent hashing algorithm as an optimization
Asynchronous
Loosely inspired by MySQL
Replicate transitions, which are small, not entities, which may be very large
Prepare
not called when replicating
Slave sends a BeginReplication
transition with an Index
and Offset
Master streams transitions from the initial Index
and Offset
forward
Slave atomically logs the cursor with each replicated transition
Addr
of master, Index
of and Offset
into master’s write-ahead log
Log rotation copies the last cursor in the previous log file into the next log file
Recovery requires reading the most recent log and accepting the last cursor found
type Table struct { ID string `index:"-"` GameID string `index:"GameID_TableID"` // ... }
Declare a primary key for the entity
(does not have to be the distribution key)
Declare entity fields to be indexed
An index maps a secondary key to a list of primary keys
Secondary indexes automatically updated during Store
and StoreExcl
Eventually consistent via asynchronous replication
Optimistic: not atomic with entity storage
Query(index, key)
with a PrepareContext
or an ApplyContext
False positives and negatives possible
Objects and blobs
ObjectStore
and BlobStore
interfaces
Load
, Store
, LoadExcl
, and StoreExcl
Layered implementations
BSONObjectStore
serializes and stores objects in a BlobStore
The key is prefixed with the object’s type
FileBlobStore
stores blobs on the filesystem
Store
and StoreExcl
are atomic via rename(2)
LoadExcl
and StoreExcl
use flock(2)
Pathnames contain an epoch to support snapshot backups
An IndexingObjectStore
has an ObjectIndex
and an ObjectStore
Calls Index
and then Store
or StoreExcl
Delegates Query
, Load
and LoadExcl
DirObjectIndex
builds secondary indexes on the filesystem
+---------+----------+----------+---------+ | | | | | | version | flags | length | type | | (uint8) | (uint16) | (uint16) | (UTF-8) | | | | | | +---------+----------+----------+---------+ +----------+----------------+ | | +------------+ | | length | | transition | | | (uint32) | | (BSON) | | | | +------------+ | +----------+----------------+
Naming the type outside of the JSON facilitates deserialization
Naive and expensive
It’s easy to create protocol buffer-like versioning and extensibility by being careful with BSON
go-bson
nicely mirrors encoding/json
go-bson
actually feels more naturally Go than goprotobuf
+---------+----------+----------+----------+---------+ | | | | | | | version | nonce | flags | length | type | | (uint8) | (uint32) | (uint16) | (uint16) | (UTF-8) | | | | | | | +---------+----------+----------+----------+---------+ +----------+----------------+ | | +------------+ | | length | | transition | | | (uint32) | | (JSON) | | | | +------------+ | +----------+----------------+
Naming the type outside of the JSON facilitates deserialization
Java: even fluent Java tastes bad to us
Scala: scary, C++-like amount of syntax
C: it’s a nice hammer to have in a pinch
Go: all but garbage collector maturity
Always google it as “golang”
Approaching four years old
Thompson, Pike, Cox, Griesemer
Statically typed
Not classically object-oriented
CSP primitives as syntax
One service spread across two Git repositories
No one had ever worked with stored procedures, much less ones with this much power
Both top-down and bottom-up
Neither top-down nor bottom-up
I tried to preserve engineers’ artistic freedom
I needed a more formal process
Protocol designed to allow non-request-oriented workloads
That never panned out
If it were HTTP, I think people would use it as an application platform, not a database
Can test each side of distribution/replication as units
Start 3 services in one Go process and sleep a bit to test together
Jump through hoops to test network errors
Sporadic timeouts that didn’t show up in metrics
Very busy lock in a metrics.Registry
Fixed the bug in go-metrics
Used to allow Dispatch
in Apply
methods
Prevent double-applying in replication
Non-deterministic replication
Eliot Moss’s Nested Transactions: An Approach to Reliable Distributed Computing
go-metrics
and Graphite
apply-transition
and get-entity
hd(1)
Budget half your time for admin tools
(not kidding)
We’re backing more and more services with this storage engine every day
High-availability coordinators
Next-generation secondary indexes
Native HTTP via soon-to-be open-sourced Go HTTP service framework