SLIs, SLOs, SLAs, Error Budgets … Oh My!
Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”
This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.
- SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
- SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
- SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals
- A great playlist from Liz Fong Jones and Seth Vargo @ Google. I love it!
- Fred Moyer talking about this stuff (20mins @ usenix 2020. Great presentation)
- Github’s deployment tool chain. Chatops + an automated state machine sitting overtop existing smaller, special purpose tools
All platforms must have a story around these functional areas. Even if you don’t think of your system as a platform (maybe you didn’t intentional design it as such) you have one!
- Service discovery
- Release management / deployment
- Internal / external routing
- Config / secrets management
- Run code
From the System Design Primer
Start with use cases and constraints
- Who is going to use it?
- How are they going to use it?
- How many users are there?
- What does the system do?
- What are the inputs and outputs of the system?
- How much data do we expect to handle?
- How many requests per second do we expect?
- What is the expected read to write ratio?
- List from cncf of licenses that are generally safe to include as dependencies. No copyleft-like provisions bothering derivative work