Categories
links

Links

  • How big technology changes happen at slack: Explore, expand, migrate. They’ve chosen a practice that involves 3 distinct phases where anyone (or nearly anyone?) can advocate for a new technology, but they must convince their peers of its value and do that by getting other people in the org to use it.
    • Most experiments fail fast which is something they like. The ones that do achieve widespread adoption make it to the migration phase where the company actively roles it out across all things.
    • It sounds great, but my question would be how do you stop a proliferation of technologies from being put to use in different spots. The maintainability of a system in such a state seems monstrous. If you bake something new that no one else uses deeply into a service, you have to learn that new thing in order to properly support and enhance that service. Does every service have 1 or 3 things like this uniquely theirs at Slack? How does this shake out? How do experiments work?
    • There has to be some friction to get to phase 1. (Along with a bunch of communication across the immediate team) You always start with a real problem you need to solve. Can you find more than 1 of a few other people who are also concerned about your problem and talk through it with them?
Categories
links

Links

  • Batch operations in REST apis: Interesting. I have never considered the implications of trying to support batch operations using a RESTful style interface. When I use REST apis, I’m usually doing things one-at-a-time (make a new thing, delete this thing) and the urls I poke at to do this largely represent a single entity. Not so I guess when you’re thinking about batch operations. The other interesting side of this is how do you indicate response status. (200 if everything’s ok, but what about partial failure?) Now clients are probably going to have to parse the response body and try to figure out what happened …
Categories
links

Links

  • The majestic monolith: A post from DHH about monoliths and microservices that resonates with me quite a bit. Side effects of complexity in a system I’m thinking of now are failures, emergent behaviour, dev + operator cognitive load, trickier production support, … Many many applications don’t need the extra complexity now and never will
  • Humble objects: Making a class easier to test by factoring out smaller bits into easily tested ones. I’ve heard the term “sprouting” recently referring to the same concept
Categories
links

Links

  • Garbage collection in jdk16: ZGC enhancements reduces gc time. More efficient memory relocation on heap collections and heap root object set scanning is avoided entirely.
  • Name your thread pools: Being able to trace back to the origin of work in a system doesn’t happen on its own. You have to plan for it. So important.
  • Serverless app: Lenskart built a system with simple components that performs well given the current feature set at a reasonable cost

Categories
links

Links

  • Async task framework design doc from dropbox: Nice discussion of the design of their job scheduler service. At least once execution, priorities, no concurrency, guaranteed start times for most jobs at a scale of 10,000 jobs per sec (at least at time of writing)

Categories
systems

An Availability Story

Marc Brooker from AWS talks about availability. 20m, very relevant stuff.

  • Availability is personal
  • Correlated failure limits availability
    • Redundancy isn’t always perfect (eg. Single points of failure)
  • Blast radius is critical to availability
  • My availability depends on the availability of my dependencies

The purpose of our system is not to hit an availability goal. (99.95% uptime)  It’s to service our customers. (People!) An uptime goal is a proxy for this.

Source

Categories
links

Links

Categories
links

Links

  • Post incident report from Twilio for Feb 26, 2021 incident: Nice writeup with aggressive, hopefully impactful action items. A critical path service was discovered post incident response with insufficient capacity and autoscaling behaviours. When it went down dependent services followed. Dependencies were built to handle a failure from this upstream service