practice, learning, culture, automation, toil, observability, SLO, SLI, what (Devops), how (SRE)
From Practice of Cloud System Administration
- Workflow: is about our process of delivering value to production from start to finish (From idea / problem identified to deployment to production) and the steps along the way. Understanding how changes are made at the lowest level lets us examine how we work and look for opportunities where we can do better
- Don’t pass defects to next step
- Don’t let local optimizations degrade performance globally
- Increase the flow of work after tasks are repeatable by speeding them up (automate), or eliminating them
- Feedback: is about amplifying information in a process forwards and backwards
- Continuous learning, experimentation: means creating a culture where it’s ok to try new things, and learn from the experience. Failure is not stigmatized (In fact there is much to be learned when things don’t work the way you expect. Your mental model of the world may need a small nudge.)
Culture of …
“You write it, you run it”
Tools, ideas, cultural aspects that help teams do their best work
SRE team goals
- Ensure our work connects to organizational goals.
- Partner with Engineering stakeholders to define a supportable and performant service architecture (paved road).
- Continuously strive to improve the customer experience: Full lifecycle support (creation, development, deployment, retirement), observability, flexible connectivity, and monitoring.
- Favor managed, commercially supported, or industry-accepted solutions over systems built in-house.
- Proactively notify the organization of any significant infrastructure changes.
- Measure success through adoption.
- Revisit design choices and components that are rendered obsolete and see what can be replaced with managed or off-the-shelf parts, or substantially simplified.
- Share SRE expertise in service to the entire PagerDuty organization.
- Factor operational costs in architectural and platform decision-making.