Srecon 2019: Building a scalable monitoring system

Molly Struve

https://www.usenix.org/conference/srecon19emea/presentation/struve

The monitoring platform that grew organically over time (can be overwhelmed by the number of different tools): New relic, honey badger (exception reporting), pagerduty, cron, dashboards, elastalert

How alerts were delivered to engineers: slack notifications, sms, email, phone

Alerts inconsistent

  • some reported data but didn’t suggest action
  • some needed immediate action

Eventually overhauled. Goals of alerting system:

  • consolidate monitoring to a single place (what does this mean?)
    • kenna used datadog for this. hooks into all other tools.
    • she’s meaning this in the alert manager sense? using different tools for logs, metrics, traces, etc
  • alerts are actionable. (no alert should allowed to be ignored) can put non actionable things away from actionable things.
  • alerts are mutable (turn them off when needed … eg when we’ve already acknowledged a problem)
    • for a set period of time (should come back on)
  • track alert history. does this condition happen regularly?

Behaviours:

  • if an alert goes off you have to acknowledge
  • here’s how you mute, and when, and how long, and how to dump alerts that aren’t helpful
    • CASE: context heavy, actionable, symptom based, evaluated
  • here’s where the monitoring tool is and how to use it
  • developers should help make monitoring better

Links

From redeploy2019: Ways we can deal with overload in a system. Note: some of these choices can be made by people, or agents, or both
From James Hamilton: Don’t let agents make choices you wouldn’t give to a junior engineer (the notion of what things a junior eng can do is possibly shifting as we get better at creating safety :))

20190923

Read

  • How Did Things Go Right? Learning More from Incidents at Netflix: Ryan Kitchens: https://www.infoq.com/news/2019/07/netflix-learn-from-incidents/. Good ideas. Often no root causes, there’s much to be learned from successful system interactions / migrations, keep an eye on how hard people have to work to keep the system up. Many pointers for further reading.

Jenkins

I’m learning more and more about Jenkins pipelines. Really cool. Today I created a job whose purpose in life is trigger other jobs with certain parameters set as part of an end-of-sprint delivery pipeline. The script {…} and build {…} directives were particularly handy. This trick is to keep everything as simple as you possibly can.

My current convention is to name jobs that should only be run by other jobs (not people) using lower-case-words-separated-by-hypens.

20190918

Learn

Read

  • Property based testing : https://increment.com/testing/in-praise-of-property-based-testing/. An interesting idea. Generalize some of the tests we right to describe sets of input, not specific examples. (An specific example passing doesn’t necessarily tell you whether your behaviour is right. You may have missed a case that would produce an error.) Is this fuzz testing?

20190916

Read

  • Bit of a memory refresh working with docker. Containers run as root. Mounting a dir into a tools container to generate output stored in the file system saves data as root. This was making Jenkins unhappy. 🙂
  • I create env variables in Jenkins to be able to use values across all stages (I can compute its value in a stage script {} block)
  • I can pass parameters around from the Jenkinsfiles. Sometimes pragmatism trumps security and other concerns.

Learn

20190915

Learn

  • Mounting an already primed Maven cache into a Maven build container makes a build much faster. Skips the dependency fetch step because they’re already there. It does require that you have a stable container host that builds run on … (Appropriate for ec2, ecs, but not Fargate. Spot or no probably doesn’t matter. The local .m2 cache can be rebuilt by the first build run on an instance.)

Read

  • https://www.hostedgraphite.com/blog/incident-postmortem-template : Summary, what happened, what went right, what went wrong, how are we going to improve. I like this framework!