Categories
systems

Honeycomb’s switch to m6g.x instances (arm64)

A short talk on the transition to arm64 instance types from intel @ aws by Shelby Spees. She talks about how they did it safely and why. Having clearly defined goals up front was important. (For honeycomb this included consistently low latency for users and reliable+fast storage of data coming from collectors.) For any change you’re making, there should be measurable value for the business or customer

Here’s a nice slide reminding me that taking care of people is super important

Categories
links

Links

  • AWS network load balancers @ Ably: Ably is a platform other developers can use to provide realtime push notifications at scale to their users. They have to handle lots of persistent connections, and a variable connection rate that can spike dramatically. Sounds like the NLB isn’t quite delivering the extreme levels of service it claims to be able to. Note: It’s an amazing box for the rest of us running applications without those constraints (Probably the vast majority of us?!)
  • Devops practice @ Algolia: Nice write up about what the team does and their process for getting things done. Work buckets: projects, operations, on call. Meetings: Once weekly Production Meetup discussing what happened last week in on-call + project statuses. Priorities: Answer customer questions, answer internal team questions, incident response, infra provisioning + management

Categories
links

Links

Categories
links

Links

  • The Document Culture of Amazon: Team member’s can cancel meetings that don’t have a document. The first 5-10 or more sometimes minutes of every meeting are spent by having everyone in the room reading about the issue under discussion so everyone starts with similar context and can participate. 1-pagers, press releases, FAQs, or 6-pages are the different formats. I’m going to try doing this in my meetings. Let’s see what people think 🙂
Categories
links work

Links

Categories
links

Links

  • Constant work builder’s library pattern: Certain aspects of route 53 and the ELB control plane have been designed such that they are always doing the amount of work that would handle peak load. They can reduce variance in a system this way (they also have to understand limits well and use cells to partition traffic to keep individual clusters within these limits)

Categories
links

Links

  • The Case for and Against Cognito: Building a user management and authentication system can be hard. (User directory, identity provider federation (SAML), …) 3rd parties can help out quite a bit here. Discusses Cognito pros, cons, and sources of confusion
  • Stuck? Do Something!: Timely post by Jamis Buck. I feel anxious when I’m asked to do something I haven’t done before, or solve a problem that’s new to me. A reminder to take a breath, and pick something to start on. It’s ok if what I try won’t work. I’ve learned something and probably have another experiment waiting in the wings because of it

Categories
systems

An Availability Story

Marc Brooker from AWS talks about availability. 20m, very relevant stuff.

  • Availability is personal
  • Correlated failure limits availability
    • Redundancy isn’t always perfect (eg. Single points of failure)
  • Blast radius is critical to availability
  • My availability depends on the availability of my dependencies

The purpose of our system is not to hit an availability goal. (99.95% uptime)  It’s to service our customers. (People!) An uptime goal is a proxy for this.

Source

Categories
links

Links

Categories
systems

Cloudfront

We’re planning to put a CDN out in front of our web application at work for well known, good reasons (performance, security, availability, etc). Here’s a tech talk from AWS about how Cloudfront works: