Mikey Dickerson Hierarchy

Service reliability hierarchy

Resilient Systems

resilient systems

Books

These books have influenced my thinking immensely. Thanks so much to their respective authors.

Agile

Stories should be written to fit INVEST criteria:

  • Independent : One story doesn’t depend on another (relates to no technical stories!). Any one story could be the next. The customer has the final say
  • Negotiable : A story is a promise to communicate not something that is set in stone
  • Valuable : Stories are small enough to be delivered to customers at a reasonable cadence. The customer can see that there is momentum and they can effect outcomes but choosing when gets done. This creates engagement
  • Estimate-able : You should have some sense of how long a thing will take. If you’re not confident, research is required
  • Small : Much less than a full sprint. I’ve found a day’s worth of effort to be good
  • Tested : How do you know you’re done?

source

Alerting

Interrupting humans is expensive. (Urgent things are stressful and we’re not doing the thing we actually want done.) Ideally we avoid poking people unless a human actually should get involved, and we provide as much context as we can to get started with looking into a problem. (eg Here’s the detected condition, here’s the alarm that fired, here’s how to go about investigating …)

Every alert should:

  • Have rich Context
  • Be Actionable
  • Be Symptom related
  • Be regularly Evaluated. Is this alert still relevant

CASE

source

Production Readiness

A few things I’ve found that help guide a conversation with developers around what operating a service in production will be like:

  • What is the service criticality? If it’s down, should we wake someone up? Who?
  • What are the service key metrics? Which indicators tell us something interesting is happening? Are these metrics being collected?
  • Is it well tested? Is a CI job setup and running builds and tests often? (Automated tests – units, functional, integration, performance, security, visual.)
  • Are logs and exceptions collected? Do we need a rotation policy?
  • Is there data that should be backed up?
  • Has a security review happened? Threat modelling?
  • Should we think about ETL?

State of Devops Report

source

2019

software delivery practices

Delivery practices that contribute to better safety, speed

psychological-safety

Psychological safety

productivity

Productivity

The Critical Path

Identify those bits of your system that make your company money. Those are the things that need to most care taken for monitoring, alerting, and reliability engineering. Anything where if it goes down, you’re no longer a company or you start losing huge amounts of money