• Configuration errors seem to come up a lot in postmortems: No answers in this point but good thoughts around why this might be true if it is. (Salesforce had a major outage recently in DNS was blamed on an operations engineer vs the system that allowed the change to be made) One point in that that resonated with me is the fact that we’re more likely to invest in multiple stages for code than config. It’s a topic we’ve discussed on my team on and off. It does seem like it would be hard to do the work that would make multi-stage config deploys possible (Certain kinds of them anyways. Some seem by their nature to be global)




  • Postfix architecture (input handlers, queues, output handlers): Fairly important for understanding how postfix works
  • SPF records: Qualifiers, mechanisms, oh my! This is a way to validate the FROM address of a message. That the sender (ip) is permitted to send email on behalf of the domain in from:. Works with dns
  • Basic postfix config: Good high level guidance for setting up postfix for specific use cases
  • Extend postfix smtpd input filtering with custom code: We were looking for a way to show backpressure to clients based on health of active and deferred queues (Don’t accept new messages addressed to email service providers we are currently having delivery trouble with. eg A large number of delayed messages). This may be a way to do that
  • On destination rate delays: If you are relaying directly to email service providers, the rate means 1 per domain. If indirect on the other hand, domain == ‘smtp nexthop’. If you only have one of these – ie you’re sending messages to an internal smtp server that relays through another before external delivery – domain in this case is NOT the recipient address domain. It is the relay server. If you only have 1 of these, then email will go out 1 at a time at the defined period

Reverse DNS (ptr) records

Mail servers will cross-check your SMTP server’s advertised HELO hostname against the PTR record for the connecting IP address, and then check that the returned name has an address record matching the connecting IP address. If any of these checks fail, then your outgoing mail may be rejected or marked as spam.

So, you need to set all three consistently: The server’s hostname and the name in the PTR record must match, and that name must resolve to the same IP address.

Note that these do not have to be the same as the domain names for which you are sending mail, and it’s common that they are not.

Reverse dns records (ptr): A discussion of how they’re used. The first comment is the most helpful (Included here for posterity :))


A tool for analyzing messages in postfix’s various queues. eg What domain’s they’re going to, and how long they’ve been there

# To list messages all messages by domain
# sudo qshape

                                      T  5 10  20 40  80 160 320  640 1280 1280+
                             TOTAL 1714  4  5 141  4 256  22  56 1218    7     1
                1714  4  5 141  4 256  22  56 1218    7     1

# Only shows messages in the deferred queue
# sudo qshape deferred
#  .. Looks like above but filtered



Get me a list of messages currently in the active queue (and possibly other queues eg hold)



# what version am I running?
cat /etc/system-release

# is selinux enabled?

# put selinux into permissive mode
setenforce 0


# list installed packages
yum list installed

# list package updates
yum check-update

# list package updates (security only)
yum check-update --security

# apply updates
yum update

# apply security updates
yum update --security

# get info about a package
yum info python3

MongoDB logical backups


Backup a specific mongodb. This is generally slower than physical backups but good for grabbing very specific subsets of data from big db *this version doesn’t include db users in the dump file

mongodump --archive=a.gz --gzip --db <dbname>

If you want users too, don’t specify the db (the admin db will be included in the dump and it has users)

mongodump --archive=a.gz --gzip

Restoring a dumpfile

mongorestore --archive=a.gz --gzip 

First 60 Seconds

# load averages

# kernel messages can be helpful sometimes
dmesg | tail

# rolling process, memory stats
vmstat 1

# rolling cpu states on multicore systems
mpstat -P ALL 1

# rolling ps aux (only shows non-idle processes)
pidstat 1

# rolling disk performance metrics
iostat -xz 1

# available memory, used, swap
free -m

# network visibility
sar -n DEV 1
sar -n TCP,ETCP 1

# running processes in a system




  • Cloudflare firewall rules: Nice writeup about how Cloudflare evolved their firewall rules product. This is something we’re looking to put in place at work


  • Marc Brooker talking about how multi-threaded programs can run more slowly than single threaded ones. (Certainly they often behave not the way we might think initially.) Some good usage of perf as well. So context switching and serializing task (synchronization?) access to a lock. Eliminating shared (global) state and the need for coordination is helpful when you’re parallelizing programs



  • How big technology changes happen at slack: Explore, expand, migrate. They’ve chosen a practice that involves 3 distinct phases where anyone (or nearly anyone?) can advocate for a new technology, but they must convince their peers of its value and do that by getting other people in the org to use it.
    • Most experiments fail fast which is something they like. The ones that do achieve widespread adoption make it to the migration phase where the company actively roles it out across all things.
    • It sounds great, but my question would be how do you stop a proliferation of technologies from being put to use in different spots. The maintainability of a system in such a state seems monstrous. If you bake something new that no one else uses deeply into a service, you have to learn that new thing in order to properly support and enhance that service. Does every service have 1 or 3 things like this uniquely theirs at Slack? How does this shake out? How do experiments work?
    • There has to be some friction to get to phase 1. (Along with a bunch of communication across the immediate team) You always start with a real problem you need to solve. Can you find more than 1 of a few other people who are also concerned about your problem and talk through it with them?

Honeycomb’s switch to m6g.x instances (arm64)

A short talk on the transition to arm64 instance types from intel @ aws by Shelby Spees. She talks about how they did it safely and why. Having clearly defined goals up front was important. (For honeycomb this included consistently low latency for users and reliable+fast storage of data coming from collectors.) For any change you’re making, there should be measurable value for the business or customer

Here’s a nice slide reminding me that taking care of people is super important