20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:

Cross posting notes from my production page:

SLIs, SLOs, SLAs, Error Budgets … Oh My!

Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”

This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.

  • SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
  • SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
  • SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals

Cultivating Production Excellence

Aspects of production excellence

  • monitor to see what’s happening
  • debug, investigate to be able to ask questions about what’s happening
  • together with the team
  • eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what

Minimal documentation for services

  • What’s it for?
  • Why is your service important? How important is it?
  • How do we mitigate issues?
  • What other services does it talk to?