Categories
links monitoring systems

Links

Categories
links

Links

Categories
monitoring systems

Netflix edgar : Tracing with correlated logs

Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.

Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.

Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.

Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)

source

Categories
continuousintegration

Links

Categories
systems

First 60 Seconds

# load averages
uptime

# kernel messages can be helpful sometimes
dmesg | tail

# rolling process, memory stats
vmstat 1

# rolling cpu states on multicore systems
mpstat -P ALL 1

# rolling ps aux (only shows non-idle processes)
pidstat 1

# rolling disk performance metrics
iostat -xz 1

# available memory, used, swap
free -m

# network visibility
sar -n DEV 1
sar -n TCP,ETCP 1

# running processes in a system
top

Resources

Categories
links

Links

  • Cloudflare waf: The waf has been redesigned and rewritten from the ground up. A very capable piece of tech my workplace is currently evaluating for use with our software (among other Cf products :))
Categories
links

Links

Categories
links

Links

  • Garbage collection in jdk16: ZGC enhancements reduces gc time. More efficient memory relocation on heap collections and heap root object set scanning is avoided entirely.
  • Name your thread pools: Being able to trace back to the origin of work in a system doesn’t happen on its own. You have to plan for it. So important.
  • Serverless app: Lenskart built a system with simple components that performs well given the current feature set at a reasonable cost

Categories
links

Links

  • Client tracing at slack: Talks about how slack is able to visualize what happens when a requests is sent from a client (browser, application) to the backend. Really neat. Mentions Honeycomb
  • Lightstep distributed tracing guide: High level guide speaks to tracing, sampling, when you need to be think about this stuff. Head-based sampling (ie. Decision made up front in a request that you’re going to start tracing – which can use a non-trivial amount of server resources – vs. tail-based where you’ve done the buffering and can decide to keep or throw away data based on testing whether there’s anything interesting contained there-in)

Categories
systems

SLIs, SLOs, SLAs

20m on how to think about defining slis, and slos. Very practicle and well researched. He references this:

Cross posting notes from my production page:

SLIs, SLOs, SLAs, Error Budgets … Oh My!

Want to be able to answer the question “When should we slow down a bit to work on making our application more reliable, or performant?”

This is a framework to define availability in concrete terms, determine acceptable levels of it, and then come to a consensus with product, development, and the business about what we do when we don’t have enough of it.

  • SLI – Service level indicator. A system metric that can be used to classify a request as either good or bad. eg 95th should be < 100ms for requests to our web application
  • SLO – Service level objective. A measure of how well we’re doing over time with respect to an agreement we make internally with ourselves of what good / bad even means. eg In the previous 30d, 99% of the time we should hit our stated latency goals. Note: A corollary is we don’t want to exceed our goals either. It’s time + energy taken away from work that could be put into making a better product
  • SLA – Service level agreement. A promise we’re comfortable making to customers about how much availability we’re willing to guarantee. There may even be financial penalties associated with failure to meet goals