- Definitions for logs vs events, traces and spans. A good high level overview of these things that are being talked about a lot by developers and production support people in the current generation of observability tools
- A log line is an unstructured or semi-structured string of characters emitted by an application, events are similar but structured (eg json), spans are particular events that represent a particular duration of time related to an application flow
- Taking control of complex systems is what we do. Are we an engineer or not isn’t really the right question. Dynamic, complex systems. Controlling processes we don’t understand.
- The author describes debugging an issue that caused cpu utilization on a webserver to increase continously and eventually crash it. A great story about honing in on a bug and reward after perseverance. And learning!
Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.
Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.
Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.
Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)
- Martin Fowler talks about his fear of public speaking. I would not have guessed this listening to him speak which I have done many times. I know exactly how he feels about this as to see him write about it brings me right back to my own feelings about it. At some point I decided that all those people I grew up in my career listening to and watching give amazing conference talks … that probably wasn’t the right path for me. There’s other work to be done 🙂
- The Legends of Runeterra build pipeline: A great end to end description of how new code (and other work) gets from a creator’s workstation to a test or release build. Pretty neat!
A really great article in the AWS builders library about making service to service messaging more reliable with tradeoffs!
First ask is this request retry-able? The work must be idempotent!
- Without thoughtful timeouts, clients can wait for long periods of time tying up limited server resources (eg Request threads of which there are often vanishlingly few) for a response that might not come back (It’s hard to tell the difference between slow, and down)
- He talks about setting a reasonable timeout using percentiles. The 99.9th for eg. Forces developers to ask the question, how many false positive timeouts is ok so that we can set a timeout that is reasonable for an endpoint
- Selfish. It says your request is worth tying up resources for repeatedly until it succeeds
- Have to be careful here
- Did a request fail because of load? If yes, retrying might prolong a bad situation
- Did it fail with a client error? (4xx) Don’t retry because it will never succeed
- Is it a part of a larger batch of work that becomes a thundering herd retrying in lockstop with eachother prolonging a bad situation
- Retrying is a keystone of resilience. But there are dragons
- Exponential backoff can help a struggling service recover by having clients wait longer between retries when they find out a service is struggling
- Some talk of circuit breakers but it didn’t sound particularly favourable. Adds a different mode in the system that makes testing more challenging
- Think about max retries + error reporting
- Jitter can help quite a bit. Not just for retries but also with the initial arrival of work. Add a tiny bit of random delay (+/-) in the arrival rate can smooth over excessive load
Other good concerns
- Retries between layers amplify. eg Controller > svc > data access | external apicall > … If each layer adds 3 retries, the work may stay in the system and be responsible for dozens or even hundreds of calls. Something to keep in mind
Note to self : Re-read Release It! (Nygaard)