Lightening talk by our very own Jon Fishbein


Junit testing with Selenide and Intellij


Breaking up with your test suite


Which shot should I get?


Liz Fong-Jones on Production Excellence

LFJ talks about production excellence

  • Just giving everyone a pager isn’t the goal. We have to ensure the system isn’t generating a lot of noise in a way that will burn people out
  • Metrics
  • SLIs
  • SLO
  • Sustainable ops
  • Think about who’s carrying the pager, and one. Maybe the person with a young child shouldn’t be carrying the pager at night but can during the day

I need to rewatch this one again. I was only half paying attention. Lots to think about here.


An Availability Story

Marc Brooker from AWS talks about availability. 20m, very relevant stuff.

  • Availability is personal
  • Correlated failure limits availability
    • Redundancy isn’t always perfect (eg. Single points of failure)
  • Blast radius is critical to availability
  • My availability depends on the availability of my dependencies

The purpose of our system is not to hit an availability goal. (99.95% uptime)  It’s to service our customers. (People!) An uptime goal is a proxy for this.




20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:

Cross posting notes from my production page:

SLIs, SLOs, SLAs, Error Budgets … Oh My!

Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”

This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.

  • SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
  • SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
  • SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals

Cultivating Production Excellence

Aspects of production excellence

  • monitor to see what’s happening
  • debug, investigate to be able to ask questions about what’s happening
  • together with the team
  • eliminate + reduce complexity because complexity increases the probability that something bad will happen and make it harder to understand exactly what

Minimal documentation for services

  • What’s it for?
  • Why is your service important? How important is it?
  • How do we mitigate issues?
  • What other services does it talk to?




We’re planning to put a CDN out in front of our web application at work for well known, good reasons (performance, security, availability, etc). Here’s a tech talk from AWS about how Cloudfront works:


Bryan Cantrill on automation, complexity, microservices, human-machine systems

Such a great talk. “Don’t think because you have an alert for something that you’re protected.” “The more alerts you have, the more information overload you may have for the operator.”

  • Ideas
    • Respect distributed systems
    • Debuggability in production
    • Debugging == new knowledge about the way a system works