Regarding retries and restarts

Richard Crowley

At the margin

We can choose to
accept another connection,
read another request,
or do nothing

Suppose we accept

We also accept the risk of and
responsibility for failure

All failures are not created equal

Some are atomic
Some are idempotent
Some leave leases or locks to expire
Maybe these can be retried

Retry policy

Retry reads
Retry atomic writes that we know failed
Retry two-phase writes we know will expire
Do not retry timeouts
Retry at most once

Nginx configuration

location / {
    error_page 500 502 503 = @retry;
    error_page 504 /504.json;
    proxy_intercept_errors on;
    proxy_next_upstream off;
    proxy_pass http://app;
}
location @retry {
    error_page 500 /500.json;
    error_page 502 /502.json;
    error_page 503 /503.json;
    error_page 504 /504.json;
    proxy_intercept_errors on;
    proxy_next_upstream off;
    proxy_pass http://app;
}
upstream app {
    # ...
}

Self-inflicted failures

Deploys, for example

Deploys are for the future

It’s too late to change the past

At the margin

We can choose to
accept another connection,
read another request,
or do nothing

Suppose we do nothing

We can’t do nothing for very long

The past

Don’t interrupt requests you can’t retry

Graceful stop

func main() {
    l, err := net.Listen("tcp", "127.0.0.1:48879")
    if nil != err {
        log.Fatalln(err)
    }
    wg := &sync.WaitGroup{}
    wg.Add(1)
    go func() {
        defer wg.Done()
        for {
            // (on the next slide)
        }
    }()
    ch := make(chan os.Signal)
    signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM)
    log.Println(<-ch)
    l.Close()
    wg.Wait()
}

Graceful stop, continued

for {
    c, err := l.Accept()
    if nil != err {
        log.Println(err)
        return
    }
    wg.Add(1)
    go func() {
        defer wg.Done()
        p := make([]byte, 4096)
        n, err := c.Read(p)
        if nil != err {
            log.Println(err)
        }
        if _, err := c.Write(p[:n]); nil != err {
            log.Println(err)
        }
    }()
}

The future

It may already be in
our socket’s listen queue

Zero-downtime restart

Parent listen(2), accept(2), etc.
Parent receive SIGUSR2
Parent setenv(3) listening file descriptor
Parent fork(2) and child execve(2)
Child getenv(3) instead of listen(2)
Child accept(2) just like before
Child send SIGQUIT to parent

Ugh, Upstart

Upstart thinks our service died
and tries to start it again

If at first you don’t succeed, add more system calls

This is experimental
(I finished the code Saturday)

Zero-downtime double restart

Parent listen(2), accept(2), etc.
Parent receive SIGUSR2
Parent setenv(3) listening file descriptor
Parent fork(2) and child execve(2)
Child getenv(3) instead of listen(2)
Child accept(2) just like before

Zero-downtime double restart, continued

Child send SIGUSR2 to parent
Parent execve(2)
Parent getenv(3) instead of listen(2)
Parent accept(2) just like before
Parent send SIGQUIT to child

Cliff’s notes

Think about reliability per-request
Use retries to mask some failures
Stop gracefully and restart without downtime to prevent self-inflicted failures

Thank you