A handy, easy to understand / explain way of thinking about monitoring a system. There are application concerns and infrastructure concerns. Many applications would probably be monitored pretty well by just considering these aspects
- Rate: how often something happens can be telling. We size our systems based on some expectation of traffic. Sometimes those guesses are just wrong or wrong during special periods (eg Retail + Black Friday)
- Errors: For a webapp, 500s are usually a good thing to keep an eye on but really anywhere in the system you have an assumption that has been invalidated by the world (eg Bad input, false preconditions). This sometimes manifests as exceptions which you should catch
- Duration (Latency): How long something takes can be the best metric we have that indicates was a user experiences
- Infra: How well are the compute resources in a system being used?
- Utilization: Each of cpu, disk, network, ram has a notion of amount used in the last monitored period. Resource usage on its own isn’t a bad thing but when you’re close to the limit of one, this can manifest in unpredictable ways in the rest of the system. Queries taking a bit longer than normal to execute
- Saturation: When a you run out of slots, or resources, or whatever you have a finite amount of, sometimes the thing to do is wait until one becomes available which can mean queuing. Queued threads waiting to make progress because of shared resources might be problematic
- Errors: Infrastructure requests fail too. 🙂 Never look at these as much as I should and I haven’t personally had an issue where something like this was involved but it happens. To be kept in mind!
Sending an alert from production to an operator, or a developer, or a customer support person is an expensive proposition. That person has to put down what they’re doing to try to figure out what’s going on and doing that can sometimes be really hard where distributed systems are concerned.
CASE is a minmum bar we set for all system alerts that helps in this regard. Note: This is a hard problem that never goes away and requires constant vigilance!
- Context heavy
- System based
Important note 1: Alerts that satisfy this criteria today, may fail tomorrow. Re-evaluate alerts periodically to ensure that they are adding value to your practice, and team