• Data engineering at Ada: Nice evolutionary story of a data usage anaylitics pipeline at Ada. Start with transaction dbs + adhoc queries, then introduce recurring jobs, then a more sophisticated scheduler, then the ability to ask questions in a more adhoc way (sql based interface to the data vs having to write a program), up to a frontend tool for visualizing queries
    • They specifically said they weren’t going to do real-time. I like this avoidance of extra complexity!


Starting a bit of a focused read this week into ETL pipelines – what they are, the abstractions, common practice around organizing logic, etc – as the topic is starting to come up at work. (We have recently hired a business analysts / revenue management type person who’s keen on being able to ask adhoc, deep questions about our users and how they’re using the system.)

Here goes!

ETL stands for extract, transform, load:

  • Extract: the first step in a pipeline is always sensory, or collection. It can work at different levels (individual events, or batch) and involves ingesting data up front (and possibly filtering) for use by later stages
  • Transform: Sometimes data needs to be cleaned up, enriched, de-duped, etc before it is delivered to a final destination (which may itself be only the beginning of another pipeline!). An example could be taking a record that is origin system formatted, and converting it into something more destination system compatible
  • Load: Concerned with delivery to systems where data will rest either for analysis or to become input into downstream pipelines

Word Cloud

extract // transform // load // data partitioning // ingest // enrich // airflow // batch // event driven // event stream // idempotent // workflow engine // real-time //


  • Airflow’s creator talks about it like it’s a workflow engine
  • ETL type processing
  • Data warehousing. Data data, and clean, well structured to support adhoc querying
  • This is an apache project
  • It’s a python application


Random Thoughts

  • When you’re small, direct integrations to specific data sinks / destinations is fine
  • Real-time can be more complex than batch ETL pipelines (but is it really?)
    • One consideration here is what happens if / when the transactional db and data warehousedb (destination) drift …