Slide 1

Slide 1 text

Behavioural Tracking Architecture for data pipelines

Slide 2

Slide 2 text

Outline • Motivation • Technical requirements • Architecture • Example: Stitch • Event sourcing • Alternative implementation • Open problems • Q and A

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Text

Slide 5

Slide 5 text

Tracking Pixels • Sending data from browser • Backbone of Ad-tech • Ubiquitous throughout the web

Slide 6

Slide 6 text

Query Operational • Logs • Payment reports Insight • Analytics • Research Product • Creator analytics • Recommendation Motivation for tracking

Slide 7

Slide 7 text

Technical Requirements • Collect and transmit data • Partition, process, filter • Make available to people

Slide 8

Slide 8 text

Production Architecture Overview Query Augmentation Storage Transmission Collection

Slide 9

Slide 9 text

Client side implementation • Instrument application logic/state • Schedule + plan for HTTP Calls • Handle Failure Production Transmission Augmentation Storage Query Collection

Slide 10

Slide 10 text

Production Transmission Augmentation Storage Query Collection Platforms Matter • Battery life on mobile is an issue • Keep alive, Scheduling, Batching • Device ID and local storage

Slide 11

Slide 11 text

Production Transmission Augmentation Storage Query Collection Collection • Front door of the pipeline • Highly available • Consistent application protocol

Slide 12

Slide 12 text

Production Transmission Augmentation Storage Query Application Protocol • Schemas matter • Must communicate failure • Evolutionary path Collection

Slide 13

Slide 13 text

Production Transmission Augmentation Storage Query HA HTTP services • DNS offers high level routing • Load Balancers • Circuit Breakers Collection

Slide 14

Slide 14 text

Production Transmission Augmentation Storage Query foo.com 10.0.1.1 10.0.1.2 collector-0.foo.com collector-1.foo.com collector-2.foo.com collector-3.foo.com DNS Record Load Balancers Application Servers Collection

Slide 15

Slide 15 text

Production Transmission Augmentation Storage Query Collection Transmission • Delivery guarantees • Highly available • Queuing behaviour

Slide 16

Slide 16 text

Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK Ideal Case Valid Data

Slide 17

Slide 17 text

Production Transmission Augmentation Storage Query Collection Sender Receiver Message At most once ACK Failure Unknown

Slide 18

Slide 18 text

Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK Failure Duplicate Case: at least once Resend Message Message Message

Slide 19

Slide 19 text

Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK Failure Exactly Once*: De-duplication Duplicate Message Filter

Slide 20

Slide 20 text

Production Transmission Augmentation Storage Query Collection RabbitMQ • Master/Slave • Complex Topology • Short lived queues • At least once Kafka • Masterless • Zookeeper • Long lived logs • At least once

Slide 21

Slide 21 text

Production Transmission Augmentation Storage Query Collection Augmentation • Transformation • Fan-out • Fan-in

Slide 22

Slide 22 text

Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Augmentation

Slide 23

Slide 23 text

Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Classifier Validated

Slide 24

Slide 24 text

Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Fan-out a b c

Slide 25

Slide 25 text

Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Fan-in a b c Composite event

Slide 26

Slide 26 text

Production Transmission Augmentation Storage Query Collection Transmission Writer Storage /data/2014/08/01.seq /data/2014/08/02.seq /data/2014/08/03.seq /data/2014/08/03.seq

Slide 27

Slide 27 text

Production Transmission Augmentation Storage Query Collection Storage • Source of truth • Scalable • Replicated • Available

Slide 28

Slide 28 text

Production Transmission Augmentation Storage Query Collection HDFS • Self hosted • Cost effective at scale • Can run multi tenant • Non trivial operational cost S3 • Managed • Cost prohibitive at scale • Network to Map Reduce • EMR cost

Slide 29

Slide 29 text

Production Transmission Augmentation Storage Query Collection Query • Database • Low vs. high latency • Common vs custom operations

Slide 30

Slide 30 text

Production Transmission Augmentation Storage Query Collection Columnar Store • Redshift/Vertica • SQL • Expensive Key Value • Cassandra/Riak • Simple queries • Complex Hadoop Based • Pig/Hive/Spark • Shared tenancy • Highly Scalable

Slide 31

Slide 31 text

Architecture Summary • Connected set of components • Different failure modes • Devils in the details Production Transmission Augmentation Storage Query Collection

Slide 32

Slide 32 text

Example: Stitch • Thousands writes/sec • 100k cassandra reads/sec • Billions of counts

Slide 33

Slide 33 text

Clients Stitch Cassandra Aggregation HDFS RabbitMQ RoR

Slide 34

Slide 34 text

Event Sourcing • Share data not state • Decoupling producers/consumers • Materialize State • Scale with volume / complexity

Slide 35

Slide 35 text

Application A State Immutable State Application B Log State Log read append append read Materialize

Slide 36

Slide 36 text

Lamdba Architecture • Developed by Nathan Marz • Real time and batch • Scalable and reliable

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

Conclusion • Motivated tracking • Stream of immutable data • Decouple producers / consumers • Mature ideas • Commoditized in the cloud

Slide 40

Slide 40 text

Thank you