Timeouts, retries with backoff, jitter

A really great article in the AWS builders library about making service to service messaging more reliable with tradeoffs!

First ask is this request retry-able? The work must be idempotent!


  • Without thoughtful timeouts, clients can wait for long periods of time tying up limited server resources (eg Request threads of which there are often vanishlingly few) for a response that might not come back (It’s hard to tell the difference between slow, and down)
  • He talks about setting a reasonable timeout using percentiles. The 99.9th for eg. Forces developers to ask the question, how many false positive timeouts is ok so that we can set a timeout that is reasonable for an endpoint


  • Selfish. It says your request is worth tying up resources for repeatedly until it succeeds
  • Have to be careful here
    • Did a request fail because of load? If yes, retrying might prolong a bad situation
    • Did it fail with a client error? (4xx) Don’t retry because it will never succeed
    • Is it a part of a larger batch of work that becomes a thundering herd retrying in lockstop with eachother prolonging a bad situation
  • Retrying is a keystone of resilience. But there are dragons
  • Exponential backoff can help a struggling service recover by having clients wait longer between retries when they find out a service is struggling
    • Some talk of circuit breakers but it didn’t sound particularly favourable. Adds a different mode in the system that makes testing more challenging
  • Think about max retries + error reporting
  • Jitter can help quite a bit. Not just for retries but also with the initial arrival of work. Add a tiny bit of random delay (+/-) in the arrival rate can smooth over excessive load

Other good concerns

  • Retries between layers amplify. eg Controller > svc > data access | external apicall > … If each layer adds 3 retries, the work may stay in the system and be responsible for dozens or even hundreds of calls. Something to keep in mind

Note to self : Re-read Release It! (Nygaard)




  • The Document Culture of Amazon: Team member’s can cancel meetings that don’t have a document. The first 5-10 or more sometimes minutes of every meeting are spent by having everyone in the room reading about the issue under discussion so everyone starts with similar context and can participate. 1-pagers, press releases, FAQs, or 6-pages are the different formats. I’m going to try doing this in my meetings. Let’s see what people think 🙂