20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:

Cross posting notes from my production page:

SLIs, SLOs, SLAs, Error Budgets … Oh My!

Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”

This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.

  • SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
  • SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
  • SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals

Cultivating Production Excellence

Aspects of production excellence

  • monitor to see what’s happening
  • debug, investigate to be able to ask questions about what’s happening
  • together with the team
  • eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what

Minimal documentation for services

  • What’s it for?
  • Why is your service important? How important is it?
  • How do we mitigate issues?
  • What other services does it talk to?




We’re planning to put a CDN out in front of our web application at work for well known, good reasons (performance, security, availability, etc). Here’s a tech talk from AWS about how Cloudfront works:


Links: 2020-12-27

  • Scalability, and load testing VALORANT: Nice discussion of how to setup a load testing test harness. “Simulated player”, “scenario”, “player pool” are the basic abstractions they settled on. Architectural concerns for the game server they thought about up front were microservices, sharding their data store, and caching


Linkroll: 2020-12-19


Link roll: 2020-12-15

  • Strava, The Boring Option: A story about a schema design decision (width of an id field in one of their tables) that worked great from 2009 – 2020 but then needed to change. A 32bit unsigned, monotonically increasing id field is good for 4b unique values before it wraps around. Depending on how many of these you’re using, it could last a long time. It did for Strava. The covid19 pandemic meant all their users were using their service way more than normal normal which accelerated the need for re-work here. They were pragmattic about what they did. They considered different datastores to store this data (huge table, lots of read/write activity on it) but in the end decided they knew mysql and were comfortable with it. They found a way using their current datastore (and reserved the right to consider different ones in the future but they had a problem to solve today). Great story!


Bryan Cantrill on automation, complexity, microservices, human-machine systems

Such a great talk. “Don’t think because you have an alert for something that you’re protected.” “The more alerts you have, the more information overload you may have for the operator.”

  • Ideas
    • Respect distributed systems
    • Debuggability in production
    • Debugging == new knowledge about the way a system works

Rewatched James Hamilton’s Keynote from Reinvent 2016

This may very well be my favourite conference talk ever. So many big ideas presented in 1.5 hours and articulated well and with enthusiasm!


Advent of Code 2020

Day 1

Watching this stream:

Watching LizTheGrey’s stream. She’s fantastic. She’s going through computer science principles that factor into the choices she’s making

Today I learned

  • Thinking about how much work we’re doing is an important exercise. Avoiding repeatedly computing the same thing makes a lot of sense (relates to algorithmic complexity)
  • If you are making an assumption about input (eg no negative values, no zeros) you can make this explicit by adding an assertion so that your program fails fast (with hopefully a helpful error) when that assumption is invalidated

Day 5

  • This one is a bit of a bugger. My solution uses a binary search strategy with low, high pointers that shift as you get closer to your goal. Not hard to write but was a bit fiddly
  • Liz (and others I can see) did something much simpler. Set ‘B’ -> 1 and ‘R’ -> 1 in the input, ignored ‘F’ and ‘L’ (or set these to zero) and somehow with only a little more energy got the answer. What the hell is going on here?

Have to think about this one a bit more

Day 7

Had to remind myself today to not try to overly anticipate what part b will ask me to do. More often than not the complexity of part a goes up and what I think is coming next doesn’t.

Keeping it simple is an important principle in systems design!

Day 13

Alright so part 2 I don’t understand the answer for. I have to think about this one a bit more. It’s only a few lines long …


  • Revisit day 13, part b



Linkroll: 2020-11-30

  • DNS load balancing: This company is using DNS load balancing to good effect for some of it’s traffic. Not machine-to-machine type api traffic it sounds like. (Works ok for human clients that honour ttl). 2 big problems with dns load balancing are 1) uneven distribution of load (a problem for load balancers too but you at least have some say in how requests are forwarded), and 2) how are failed servers removed from the pool?
  • Cloudflare postmortem (Byzantine failure in etcd cluster): Interesting. A few distributed systems bolted together to create a bigger one. Each individual component is “fault tolerant” on it’s own but there emerge new kinds of failures when they are connected to eachother. Keep it boring for as long as you possibly can! This is usually a lot longer than you think