Srecon2019 : Growing Infrastructure @ Stripe

Things we care about : scaling people + processes, avoiding burnout, working with highly skilled people with lots of autonomy

eg Infra

  • developer tools : build, tests, ci, cd
  • data infra
  • core libraries & frameworks
  • model training and evaluation
  • “Tools used by 3+ teams that is business critical”

A couple of dimensions we care about

Forced work vs discretionary (continuum): Forced : scaling mongo, lower costs, gdpr. Discretionary : server to service (no containers), deep learning

Short term and long term (another continuum): Short : critical remediation, support launch. Long : QoS strategy, “bend the cost curve”?, rewrite the monolith

Ideal : towards long term, towards discretionary (not fire fighting). BUT NOT TOO MUCH!

Hacks

Reduce WIP: Doing lots of things == not finishing any one thing. Not deriving any value. FINISH SOMETHING USEFUL!

If you’re never doing anything but firefighting, you have to hire. Once there’s progress, stay the course. Lol, don’t fall in love with firefighting. 🙂

What do you do? How do you learn?

tl;dr. Listen (talk?) to your users more.

Discovery tools

  • Benchmark with peer companies (similarly sized, what are they working on / struggling with)
  • Coffee chats with users
  • SLOs
  • Developer surveys

An eg from Stripe Sorbet

(A project stripe invested in to improve life)

Invested in static typing for ruby to create more stability, safety, speed

THIS: Learning a new piece of tech is literally never what the business wants from you. Might be incidental to something more important but not the first thing.

Pull a user into the room when you’re talking about priorities

Innovation problem: right opportunity, wrong solution. Guh.

eg “Let’s rewrite it?” Need context

Approach validation : start by violently trying to disprove that your thing will work. (… Like google moonshot programs). Try hardest cases early.

Embed with teams (people who will use your thing)

Investment lanes for infra @stripe

  • Security: invest in better security
  • Reliability: and make the platforms and tools around it better. Deployment, environments, monitoring
  • Usability: Lead, cycle time for features from idea to getting data in prod
  • Efficiency
  • Latency
  • *in priority order

eg Investment weights, completely arbitrary, change from cycle to cycle (eg spring, quarter, half, …)

  • 40% user asks <-Whatever they want. If we don’t understand the ask, that’s our fault
  • 30% platform quality
  • 30% key initiatives

The container operator’s manual

The container operator’s manual

  • good for
    • good for reproducing perfectly certain environments
    • good for stateless apps
    • good for test environments because they’re easy to clean up after
    • good for upgrades <-very easy to do
  • bad for stateful apps … don’t containerize your database <-this is hard
    • you’re not google
    • don’t
  • takes longer than you’d expect … containerization is hard
  • give it to ops … LOL! they already do deployment && config management && incident response && monitoring && infrastructure management && security … why not give the container project to them

Srecon 2019: Building a scalable monitoring system

Molly Struve

https://www.usenix.org/conference/srecon19emea/presentation/struve

The monitoring platform that grew organically over time (can be overwhelmed by the number of different tools): New relic, honey badger (exception reporting), pagerduty, cron, dashboards, elastalert

How alerts were delivered to engineers: slack notifications, sms, email, phone

Alerts inconsistent

  • some reported data but didn’t suggest action
  • some needed immediate action

Eventually overhauled. Goals of alerting system:

  • consolidate monitoring to a single place (what does this mean?)
    • kenna used datadog for this. hooks into all other tools.
    • she’s meaning this in the alert manager sense? using different tools for logs, metrics, traces, etc
  • alerts are actionable. (no alert should allowed to be ignored) can put non actionable things away from actionable things.
  • alerts are mutable (turn them off when needed … eg when we’ve already acknowledged a problem)
    • for a set period of time (should come back on)
  • track alert history. does this condition happen regularly?

Behaviours:

  • if an alert goes off you have to acknowledge
  • here’s how you mute, and when, and how long, and how to dump alerts that aren’t helpful
    • CASE: context heavy, actionable, symptom based, evaluated
  • here’s where the monitoring tool is and how to use it
  • developers should help make monitoring better

Michael Smith’s sweet potato vegetarian chili with cinnamon sour cream

INGREDIENTS

  • 2 tablespoons vegetable oil
  • 1 large onion, chopped
  • 1 green bell pepper, seeded and chopped
  • 8 garlic cloves, thinly sliced
  • 1 tablespoon cumin seeds
  • 1 tablespoon chili powder
  • 1 tablespoon dried oregano
  • 2 cups fresh or frozen corn
  • 1 398-millilitre can black beans, rinsed and drained
  • 1 398-millilitre can kidney beans, rinsed and drained
  • 1 796-millilitre can whole tomatoes
  • 1 sweet potato, peeled and finely diced
  • 1 tablespoon canned chipotle chilies in adobo sauce, chopped
  • 1/2 cup sour cream
  • 1 teaspoon cinnamon
  • 1 or 2 sprinkles salt
  • 1 cup tender cilantro sprigs
  • 2 green onions, thinly sliced

METHOD

Heat the oil in a soup pot over medium-high heat. Toss in the onions and green pepper and sauté, stirring frequently, until the vegetables begin to brown (6 to 8 minutes). Stir in the garlic, cumin seeds, chili powder and oregano. Reduce the heat to medium and cook, stirring, until the spices are very fragrant (another 2 minutes or so).

Stir in the corn, black beans and kidney beans. Add the juice from the canned tomatoes, then coarsely chop the tomatoes and add them as well. Add the sweet potatoes and chipotle chilies. Bring to a boil, then reduce the heat so the liquid is just barely simmering. Simmer, stirring frequently, until the sweet potatoes are tender and the chili begins to thicken (20 to 25 minutes).

Meanwhile, stir together the sour cream and cinnamon. Just before serving, season the chili to your taste with salt. Ladle into serving bowls and top with the sour cream and a tangle of cilantro and green onions.