monitoring systems

Netflix edgar : Tracing with correlated logs

Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.

Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.

Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.

Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)