Mechanics of Data Pipelines

Who Am I Data Science & Engineering * Machine Learning
* Distributed Systems * Computation Art

Outline * Problem motivation @ SC * Proposed properties *
Mechanics * Open problems

SoundCloud * ~300 employees * Most working with data *
Product driven * Microservices

Organisations which design systems ... are constrained to produce designs
which are copies of the communication structures of these organisations — M. Conway Conway's law

Counting service * 10^6 reads/sec * 10^4 writes/sec * Maintains
counts for all time

Challenges * Counts are subject to spam * Deleting data
is painful * Can’t do full lambda

Royalties Reporting * Complex calculation * Interdisciplinary effort * Low
frequency

Challenges * Wide dependencies * Multiplicative error * Reports require
auditing

Internal Analytics * AB Testing * Ad hoc queries *
Dashboards

Challenges * Lots of small ETL * Trans-datastore * Deep
dependencies

TODO It’s a mess Emergent Data Pipelines * Different runtimes
* Untouchable Legacy * Different processing steps * Cross dependencies

Failure * Transient Failure * Input set variance * Hard
to test new code * Misunderstood dependency

TODO It’s a mess Consequence * Blocking Failures * Risk
Aversion * Manual Intervention * Low coherence

Desired * Minimal human intervention * High coherence * Risk
Tolerant * Incremental

Convergent * Incremental improvement * Automatic propagation * Low human
intervention

Retroactive * Incremental re-computation * Time independence * Map input
-> output

Persistent * Keep old computations * Comparable version * Don’t
fear change

Entities

Datasource * Lambda architecture * Low complexity repairs * Inefﬁcient

Partition * Date/time buckets * Time dependencies * Moderate efﬁciency

Chunk * Individual ﬁles * High incrementality * High complexity

Entities

Redshift

Cassandra

Persistent

Retroactive

Convergent

Open Problems * Acceptance Testing * Materialising views * Data
Discovery

Acceptance Testing * Interface focused * Risk minimisation * Continuous
delivery

Materialised Views * Ephemeral representation * Non durable * Asynchronous

Data Discovery * Consumers/Producers * Service which coordinates jobs *
Movable data sets

* Conways law is real * Design as data structure
* Abstract and apply Conclusion

Emily Green Omid Aladini S e b a s t
i a n O h m F ro n x Wurmus Matthias Georgi Thank You David Whiting Lorand Kasler Gavin Bell Jon Glover Erik Bartels

Mechanics of Data Pipelines

Mechanics of Data Pipelines

More Decks by Sean Braithwaite

Featured

Transcript