- Definitions for logs vs events, traces and spans. A good high level overview of these things that are being talked about a lot by developers and production support people in the current generation of observability tools
- A log line is an unstructured or semi-structured string of characters emitted by an application, events are similar but structured (eg json), spans are particular events that represent a particular duration of time related to an application flow
- Taking control of complex systems is what we do. Are we an engineer or not isn’t really the right question. Dynamic, complex systems. Controlling processes we don’t understand.
Annoyingly qcon links aren’t embeddable. This was a great talk about an internal tool made at Netflix that is used by developers and production support engineers (sre, operations, customer support) to learn about errors.
Tracing becomes especially important when you have many services involved in processing a single request. Putting together a picture of what happened when logs and metrics are scattered across log categories and dashboards (could be 1 per service in the worst case) is hard.
Edgar has a global view. It was important that all telemetry sources were fed into edgar. It wouldn’t have been a tool people could rely on if there were gaps.
Another important design decision was the sampling rate. Collecting traces is hard. (aka Resource intensive in a system in terms of ram) But less than 100% tracing means when you go to look for one, there’s a chance it won’t be there. The suggestion was to collect 100% for a small, critical subset of traffic. (eg /checkout)
A really great article in the AWS builders library about making service to service messaging more reliable with tradeoffs!
First ask is this request retry-able? The work must be idempotent!
- Without thoughtful timeouts, clients can wait for long periods of time tying up limited server resources (eg Request threads of which there are often vanishlingly few) for a response that might not come back (It’s hard to tell the difference between slow, and down)
- He talks about setting a reasonable timeout using percentiles. The 99.9th for eg. Forces developers to ask the question, how many false positive timeouts is ok so that we can set a timeout that is reasonable for an endpoint
- Selfish. It says your request is worth tying up resources for repeatedly until it succeeds
- Have to be careful here
- Did a request fail because of load? If yes, retrying might prolong a bad situation
- Did it fail with a client error? (4xx) Don’t retry because it will never succeed
- Is it a part of a larger batch of work that becomes a thundering herd retrying in lockstop with eachother prolonging a bad situation
- Retrying is a keystone of resilience. But there are dragons
- Exponential backoff can help a struggling service recover by having clients wait longer between retries when they find out a service is struggling
- Some talk of circuit breakers but it didn’t sound particularly favourable. Adds a different mode in the system that makes testing more challenging
- Think about max retries + error reporting
- Jitter can help quite a bit. Not just for retries but also with the initial arrival of work. Add a tiny bit of random delay (+/-) in the arrival rate can smooth over excessive load
Other good concerns
- Retries between layers amplify. eg Controller > svc > data access | external apicall > … If each layer adds 3 retries, the work may stay in the system and be responsible for dozens or even hundreds of calls. Something to keep in mind
Note to self : Re-read Release It! (Nygaard)
- Configuration errors seem to come up a lot in postmortems: No answers in this point but good thoughts around why this might be true if it is. (Salesforce had a major outage recently in DNS was blamed on an operations engineer vs the system that allowed the change to be made) One point in that that resonated with me is the fact that we’re more likely to invest in multiple stages for code than config. It’s a topic we’ve discussed on my team on and off. It does seem like it would be hard to do the work that would make multi-stage config deploys possible (Certain kinds of them anyways. Some seem by their nature to be global)
- Postfix architecture (input handlers, queues, output handlers): Fairly important for understanding how postfix works
- SPF records: Qualifiers, mechanisms, oh my! This is a way to validate the FROM address of a message. That the sender (ip) is permitted to send email on behalf of the domain in from:. Works with dns
- Basic postfix config: Good high level guidance for setting up postfix for specific use cases
- Extend postfix smtpd input filtering with custom code: We were looking for a way to show backpressure to clients based on health of active and deferred queues (Don’t accept new messages addressed to email service providers we are currently having delivery trouble with. eg A large number of delayed messages). This may be a way to do that
- On destination rate delays: If you are relaying directly to email service providers, the rate means 1 per domain. If indirect on the other hand, domain == ‘smtp nexthop’. If you only have one of these – ie you’re sending messages to an internal smtp server that relays through another before external delivery – domain in this case is NOT the recipient address domain. It is the relay server. If you only have 1 of these, then email will go out 1 at a time at the defined period
Reverse DNS (ptr) records
Mail servers will cross-check your SMTP server’s advertised HELO hostname against the PTR record for the connecting IP address, and then check that the returned name has an address record matching the connecting IP address. If any of these checks fail, then your outgoing mail may be rejected or marked as spam.
So, you need to set all three consistently: The server’s hostname and the name in the PTR record must match, and that name must resolve to the same IP address.
Note that these do not have to be the same as the domain names for which you are sending mail, and it’s common that they are not.Reverse dns records (ptr): A discussion of how they’re used. The first comment is the most helpful (Included here for posterity :))
A tool for analyzing messages in postfix’s various queues. eg What domain’s they’re going to, and how long they’ve been there
# To list messages all messages by domain # sudo qshape T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 1714 4 5 141 4 256 22 56 1218 7 1 gmail.com 1714 4 5 141 4 256 22 56 1218 7 1 # Only shows messages in the deferred queue # sudo qshape deferred # .. Looks like above but filtered
Get me a list of messages currently in the active queue (and possibly other queues eg hold)
# what version am I running? cat /etc/system-release # is selinux enabled? getenforce # put selinux into permissive mode setenforce 0
# list installed packages yum list installed # list package updates yum check-update # list package updates (security only) yum check-update --security # apply updates yum update # apply security updates yum update --security # get info about a package yum info python3
Backup a specific mongodb. This is generally slower than physical backups but good for grabbing very specific subsets of data from big db *this version doesn’t include db users in the dump file
mongodump --archive=a.gz --gzip --db <dbname>
If you want users too, don’t specify the db (the admin db will be included in the dump and it has users)
mongodump --archive=a.gz --gzip
Restoring a dumpfile
mongorestore --archive=a.gz --gzip
# load averages uptime # kernel messages can be helpful sometimes dmesg | tail # rolling process, memory stats vmstat 1 # rolling cpu states on multicore systems mpstat -P ALL 1 # rolling ps aux (only shows non-idle processes) pidstat 1 # rolling disk performance metrics iostat -xz 1 # available memory, used, swap free -m # network visibility sar -n DEV 1 sar -n TCP,ETCP 1 # running processes in a system top