Srecon2019 : Growing Infrastructure @ Stripe

Things we care about : scaling people + processes, avoiding burnout, working with highly skilled people with lots of autonomy

eg Infra

  • developer tools : build, tests, ci, cd
  • data infra
  • core libraries & frameworks
  • model training and evaluation
  • “Tools used by 3+ teams that is business critical”

A couple of dimensions we care about

Forced work vs discretionary (continuum): Forced : scaling mongo, lower costs, gdpr. Discretionary : server to service (no containers), deep learning

Short term and long term (another continuum): Short : critical remediation, support launch. Long : QoS strategy, “bend the cost curve”?, rewrite the monolith

Ideal : towards long term, towards discretionary (not fire fighting). BUT NOT TOO MUCH!


Reduce WIP: Doing lots of things == not finishing any one thing. Not deriving any value. FINISH SOMETHING USEFUL!

If you’re never doing anything but firefighting, you have to hire. Once there’s progress, stay the course. Lol, don’t fall in love with firefighting. 🙂

What do you do? How do you learn?

tl;dr. Listen (talk?) to your users more.

Discovery tools

  • Benchmark with peer companies (similarly sized, what are they working on / struggling with)
  • Coffee chats with users
  • SLOs
  • Developer surveys

An eg from Stripe Sorbet

(A project stripe invested in to improve life)

Invested in static typing for ruby to create more stability, safety, speed

THIS: Learning a new piece of tech is literally never what the business wants from you. Might be incidental to something more important but not the first thing.

Pull a user into the room when you’re talking about priorities

Innovation problem: right opportunity, wrong solution. Guh.

eg “Let’s rewrite it?” Need context

Approach validation : start by violently trying to disprove that your thing will work. (… Like google moonshot programs). Try hardest cases early.

Embed with teams (people who will use your thing)

Investment lanes for infra @stripe

  • Security: invest in better security
  • Reliability: and make the platforms and tools around it better. Deployment, environments, monitoring
  • Usability: Lead, cycle time for features from idea to getting data in prod
  • Efficiency
  • Latency
  • *in priority order

eg Investment weights, completely arbitrary, change from cycle to cycle (eg spring, quarter, half, …)

  • 40% user asks <-Whatever they want. If we don’t understand the ask, that’s our fault
  • 30% platform quality
  • 30% key initiatives