Marc Brooker from AWS talks about availability. 20m, very relevant stuff.
Availability is personal
Correlated failure limits availability
Redundancy isn’t always perfect (eg. Single points of failure)
Blast radius is critical to availability
My availability depends on the availability of my dependencies
The purpose of our system is not to hit an availability goal. (99.95% uptime) It’s to service our customers. (People!) An uptime goal is a proxy for this.
20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:
Cross posting notes from my production page:
SLIs, SLOs, SLAs, Error Budgets … Oh My!
Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”
This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.
SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals
debug, investigate to be able to ask questions about what’s happening
together with the team
eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what
Minimal documentation for services
What’s it for?
Why is your service important? How important is it?
We’re planning to put a CDN out in front of our web application at work for well known, good reasons (performance, security, availability, etc). Here’s a tech talk from AWS about how Cloudfront works:
Such a great talk. “Don’t think because you have an alert for something that you’re protected.” “The more alerts you have, the more information overload you may have for the operator.”
Ideas
Respect distributed systems
Debuggability in production
Debugging == new knowledge about the way a system works