Liz Fong-Jones on Production Excellence

LFJ talks about production excellence

  • Just giving everyone a pager isn’t the goal. We have to ensure the system isn’t generating a lot of noise in a way that will burn people out
  • Metrics
  • SLIs
  • SLO
  • Sustainable ops
  • Think about who’s carrying the pager, and one. Maybe the person with a young child shouldn’t be carrying the pager at night but can during the day

I need to rewatch this one again. I was only half paying attention. Lots to think about here.



  • AWS network load balancers @ Ably: Ably is a platform other developers can use to provide realtime push notifications at scale to their users. They have to handle lots of persistent connections, and a variable connection rate that can spike dramatically. Sounds like the NLB isn’t quite delivering the extreme levels of service it claims to be able to. Note: It’s an amazing box for the rest of us running applications without those constraints (Probably the vast majority of us?!)
  • Devops practice @ Algolia: Nice write up about what the team does and their process for getting things done. Work buckets: projects, operations, on call. Meetings: Once weekly Production Meetup discussing what happened last week in on-call + project statuses. Priorities: Answer customer questions, answer internal team questions, incident response, infra provisioning + management



20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:

Cross posting notes from my production page:

SLIs, SLOs, SLAs, Error Budgets … Oh My!

Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”

This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.

  • SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
  • SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
  • SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals

Cultivating Production Excellence

Aspects of production excellence

  • monitor to see what’s happening
  • debug, investigate to be able to ask questions about what’s happening
  • together with the team
  • eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what

Minimal documentation for services

  • What’s it for?
  • Why is your service important? How important is it?
  • How do we mitigate issues?
  • What other services does it talk to?