Putting a service in production is among the most important things SREs do and should always involve careful planning and preparation. (Otherwise it probably isn’t that important which begs the question: why are we even putting this into production in the first place?)
Operating complex, distributed systems in production is a topic I’m especially interested in these days, and on occasion, I collect notes about it.
- Does your new service come with a health dashboard?
- Link please !
- Some things to show: host, application metrics + any metrics extracted from log events
- Have notifications been setup around key metrics that impact customer experience when performance thresholds are crossed? (Alerts, 3 levels: observed, important (24hr resolution SLO), emergency fix now (Broken SLA))
Documenting a service
Who owns it and who knows the most about it if these aren’t one and the same person. Include contact info – email, phone, etc – and any alternate contacts.
- Include an Architecture diagram, and
- Any dependent services (eg If service A fails, and my service depends on A my service fails.), as well as
- Instructions, details how we might stop and start it, and finally
- Is there any special performance related config? (eg load shedding, function disabling, …)
Documenting a service – Logs
- Where are they?
- Are they being rotated?
- Are they being collected centrally?
Documenting a service – Runbooks
- List of standard operational tasks (daily, weekly, monthly)
- How to recover from failures we can predict
- A process to document ones we didn’t
- Unit, Integration, UAT
- Load, performance, stress
What do you do if…
- You get 10x more traffic than expected day 1 <- think about a scaling strategy
The 12 Factor Criteria
JA Micro (Sixt)
- Are delivered to prod in containers (Or fat jars for Java)
- Get configuration values from the process environment
- Log using a standard log format (json, structured, machine readable)
- Report application state, computation progress as metrics (counters, timers, gauges)
- Provide an external endpoint for health checks (Am I healthy right now?, http)
- Provide test components