monitoring systems

Netflix edgar : Tracing with correlated logs

Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.

Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.

Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.

Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)




  • Configuration errors seem to come up a lot in postmortems: No answers in this point but good thoughts around why this might be true if it is. (Salesforce had a major outage recently in DNS was blamed on an operations engineer vs the system that allowed the change to be made) One point in that that resonated with me is the fact that we’re more likely to invest in multiple stages for code than config. It’s a topic we’ve discussed on my team on and off. It does seem like it would be hard to do the work that would make multi-stage config deploys possible (Certain kinds of them anyways. Some seem by their nature to be global)