Interrupting humans is expensive. (Urgent things are stressful and we’re not doing the thing we actually want done.) Ideally we avoid poking people unless a human actually should get involved, and we provide as much context as we can to get started with looking into a problem. (eg Here’s the detected condition, here’s the alarm that fired, here’s how to go about investigating …)
Every alert should:
- Have rich Context
- Be Actionable
- Be Symptom related
- Be regularly Evaluated. Is this alert still relevant
“CASE” / [source]
Production Readiness Checklist
A few things I’ve found that help guide a conversation with developers around what operating a service in production will be like:
- What is the service criticality? If it’s down, should we wake someone up?
- What are the service key metrics? Which indicators tell us something interesting is happening? Are these metrics being collected?
- Is it well tested? Is a CI job setup and running builds and tests often? (Automated tests – units, functional, integration, performance, security, visual.)
- Are logs and exceptions collected? Do we need a rotation policy?
- Is there data that should be backed up?
- Has a security review happened? Threat modelling?
- Should we think about ETL?
- The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2: https://www.amazon.ca/Practice-Cloud-System-Administration-Practices/dp/032194318X/. This book has influenced much of my current thinking around good software development & delivery practice. I can’t recommend it enough.