Checklist: Prod Readiness

Stuff I think about when I’m getting a new application ready to run in prod. They have all caused me pain in one way or another at some point … 🙂

  • How critical is it to the business? (Should an engineer be woken up in the middle of the night if it goes down?)
  • Monitoring
    • Metrics for a webapp: traffic volume, latency, and errors
    • Logs are being shipped to a central place where we can setup filters and alerts on them
      • Are logs being rotated? Should they be?
    • Exceptions are being captured and reviewed by somebody
    • Is there a /healthCheck url we can use to determine readiness?
      • This check should test the service itself, and dependencies in a meaningful way to let us know it is ready to do work
      • It should be fast and return 200 if ok, 500 otherwise
      • This is the url we’ll setup a load balancer to ping
  • Is there data to be backed up? (If we are taking backups we should be verifying we can restore them)
  • Do we have environments including develop, staging, and production and a process to promote changes through them
    • How do we deploy new versions of this?
  • Is it well documented?
    • Service pages are nice (Who owns the service, an architecture diagram, links to runbooks, links to dashboards)
  • Show me the tests! (Unit, end to end and other. Should be automated and able to run all the time)
  • Have we gone through a threat modelling exercise with it? (Talk about principals, goals, adversities, invariants)