Regarding retries and restarts
Richard Crowley
Richard Crowley
We can choose to
accept another connection,
read another request,
or do nothing
We also accept the risk of and
responsibility for failure
Some are atomic
Some are idempotent
Some leave leases or locks to expire
Maybe these can be retried
Retry reads
Retry atomic writes that we know failed
Retry two-phase writes we know will expire
Do not retry timeouts
Retry at most once
location / { error_page 500 502 503 = @retry; error_page 504 /504.json; proxy_intercept_errors on; proxy_next_upstream off; proxy_pass http://app; } location @retry { error_page 500 /500.json; error_page 502 /502.json; error_page 503 /503.json; error_page 504 /504.json; proxy_intercept_errors on; proxy_next_upstream off; proxy_pass http://app; } upstream app { # ... }
Deploys, for example
It’s too late to change the past
We can choose to
accept another connection,
read another request,
or do nothing
We can’t do nothing for very long
Don’t interrupt requests you can’t retry
func main() { l, err := net.Listen("tcp", "127.0.0.1:48879") if nil != err { log.Fatalln(err) } wg := &sync.WaitGroup{} wg.Add(1) go func() { defer wg.Done() for { // (on the next slide) } }() ch := make(chan os.Signal) signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM) log.Println(<-ch) l.Close() wg.Wait() }
for { c, err := l.Accept() if nil != err { log.Println(err) return } wg.Add(1) go func() { defer wg.Done() p := make([]byte, 4096) n, err := c.Read(p) if nil != err { log.Println(err) } if _, err := c.Write(p[:n]); nil != err { log.Println(err) } }() }
It may already be in
our socket’s listen queue
Parent listen(2)
, accept(2)
, etc.
Parent receive SIGUSR2
Parent setenv(3)
listening file descriptor
Parent fork(2)
and child execve(2)
Child getenv(3)
instead of listen(2)
Child accept(2)
just like before
Child send SIGQUIT
to parent
Upstart thinks our service died
and tries to start it again
This is experimental
(I finished the code Saturday)
Parent listen(2)
, accept(2)
, etc.
Parent receive SIGUSR2
Parent setenv(3)
listening file descriptor
Parent fork(2)
and child execve(2)
Child getenv(3)
instead of listen(2)
Child accept(2)
just like before
Child send SIGUSR2
to parent
Parent execve(2)
Parent getenv(3)
instead of listen(2)
Parent accept(2)
just like before
Parent send SIGQUIT
to child
Think about reliability per-request
Use retries to mask some failures
Stop gracefully and restart without downtime to prevent self-inflicted failures