- Taking control of complex systems is what we do. Are we an engineer or not isn’t really the right question. Dynamic, complex systems. Controlling processes we don’t understand.
A really great article in the AWS builders library about making service to service messaging more reliable with tradeoffs!
First ask is this request retry-able? The work must be idempotent!
- Without thoughtful timeouts, clients can wait for long periods of time tying up limited server resources (eg Request threads of which there are often vanishlingly few) for a response that might not come back (It’s hard to tell the difference between slow, and down)
- He talks about setting a reasonable timeout using percentiles. The 99.9th for eg. Forces developers to ask the question, how many false positive timeouts is ok so that we can set a timeout that is reasonable for an endpoint
- Selfish. It says your request is worth tying up resources for repeatedly until it succeeds
- Have to be careful here
- Did a request fail because of load? If yes, retrying might prolong a bad situation
- Did it fail with a client error? (4xx) Don’t retry because it will never succeed
- Is it a part of a larger batch of work that becomes a thundering herd retrying in lockstop with eachother prolonging a bad situation
- Retrying is a keystone of resilience. But there are dragons
- Exponential backoff can help a struggling service recover by having clients wait longer between retries when they find out a service is struggling
- Some talk of circuit breakers but it didn’t sound particularly favourable. Adds a different mode in the system that makes testing more challenging
- Think about max retries + error reporting
- Jitter can help quite a bit. Not just for retries but also with the initial arrival of work. Add a tiny bit of random delay (+/-) in the arrival rate can smooth over excessive load
Other good concerns
- Retries between layers amplify. eg Controller > svc > data access | external apicall > … If each layer adds 3 retries, the work may stay in the system and be responsible for dozens or even hundreds of calls. Something to keep in mind
Note to self : Re-read Release It! (Nygaard)
- Configuration errors seem to come up a lot in postmortems: No answers in this point but good thoughts around why this might be true if it is. (Salesforce had a major outage recently in DNS was blamed on an operations engineer vs the system that allowed the change to be made) One point in that that resonated with me is the fact that we’re more likely to invest in multiple stages for code than config. It’s a topic we’ve discussed on my team on and off. It does seem like it would be hard to do the work that would make multi-stage config deploys possible (Certain kinds of them anyways. Some seem by their nature to be global)
- Post incident report from Twilio for Feb 26, 2021 incident: Nice writeup with aggressive, hopefully impactful action items. A critical path service was discovered post incident response with insufficient capacity and autoscaling behaviours. When it went down dependent services followed. Dependencies were built to handle a failure from this upstream service