- Definitions for logs vs events, traces and spans. A good high level overview of these things that are being talked about a lot by developers and production support people in the current generation of observability tools
- A log line is an unstructured or semi-structured string of characters emitted by an application, events are similar but structured (eg json), spans are particular events that represent a particular duration of time related to an application flow
Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.
Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.
Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.
Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)
- Another great tracing post from Slack: This one is from before the last and describes internal project requirements, the problem they were trying to solve, and limitations of Zipkin and Jaeger. One key point was the tracing system should be useful for non-backend use cases
- Client tracing at slack: Talks about how slack is able to visualize what happens when a requests is sent from a client (browser, application) to the backend. Really neat. Mentions Honeycomb
- Lightstep distributed tracing guide: High level guide speaks to tracing, sampling, when you need to be think about this stuff. Head-based sampling (ie. Decision made up front in a request that you’re going to start tracing – which can use a non-trivial amount of server resources – vs. tail-based where you’ve done the buffering and can decide to keep or throw away data based on testing whether there’s anything interesting contained there-in)