- Just giving everyone a pager isn’t the goal. We have to ensure the system isn’t generating a lot of noise in a way that will burn people out
- Sustainable ops
- Think about who’s carrying the pager, and one. Maybe the person with a young child shouldn’t be carrying the pager at night but can during the day
I need to rewatch this one again. I was only half paying attention. Lots to think about here.
Marc Brooker from AWS talks about availability. 20m, very relevant stuff.
- Availability is personal
- Correlated failure limits availability
- Redundancy isn’t always perfect (eg. Single points of failure)
- Blast radius is critical to availability
- My availability depends on the availability of my dependencies
The purpose of our system is not to hit an availability goal. (99.95% uptime) It’s to service our customers. (People!) An uptime goal is a proxy for this.
20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:
Cross posting notes from my production page:
SLIs, SLOs, SLAs, Error Budgets … Oh My!
Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”
This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.
- SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
- SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
- SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals
Aspects of production excellence
- monitor to see what’s happening
- debug, investigate to be able to ask questions about what’s happening
- together with the team
- eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what
Minimal documentation for services
- What’s it for?
- Why is your service important? How important is it?
- How do we mitigate issues?
- What other services does it talk to?
We’re planning to put a CDN out in front of our web application at work for well known, good reasons (performance, security, availability, etc). Here’s a tech talk from AWS about how Cloudfront works:
Such a great talk. “Don’t think because you have an alert for something that you’re protected.” “The more alerts you have, the more information overload you may have for the operator.”
- Respect distributed systems
- Debuggability in production
- Debugging == new knowledge about the way a system works