Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture for Behavioural Tracking

Architecture for Behavioural Tracking

This talk covers a the architecture of a data pipeline from the front end integration to the report generation.

Sean Braithwaite

November 20, 2014
Tweet

More Decks by Sean Braithwaite

Other Decks in Programming

Transcript

  1. Outline • Motivation • Technical requirements • Architecture • Example:

    Stitch • Event sourcing • Alternative implementation • Open problems • Q and A
  2. Tracking Pixels • Sending data from browser • Backbone of

    Ad-tech • Ubiquitous throughout the web
  3. Query Operational • Logs • Payment reports Insight • Analytics

    • Research Product • Creator analytics • Recommendation Motivation for tracking
  4. Client side implementation • Instrument application logic/state • Schedule +

    plan for HTTP Calls • Handle Failure Production Transmission Augmentation Storage Query Collection
  5. Production Transmission Augmentation Storage Query Collection Platforms Matter • Battery

    life on mobile is an issue • Keep alive, Scheduling, Batching • Device ID and local storage
  6. Production Transmission Augmentation Storage Query Collection Collection • Front door

    of the pipeline • Highly available • Consistent application protocol
  7. Production Transmission Augmentation Storage Query Application Protocol • Schemas matter

    • Must communicate failure • Evolutionary path Collection
  8. Production Transmission Augmentation Storage Query HA HTTP services • DNS

    offers high level routing • Load Balancers • Circuit Breakers Collection
  9. Production Transmission Augmentation Storage Query foo.com 10.0.1.1 10.0.1.2 collector-0.foo.com collector-1.foo.com

    collector-2.foo.com collector-3.foo.com DNS Record Load Balancers Application Servers Collection
  10. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Duplicate Case: at least once Resend Message Message Message
  11. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Exactly Once*: De-duplication Duplicate Message Filter
  12. Production Transmission Augmentation Storage Query Collection RabbitMQ • Master/Slave •

    Complex Topology • Short lived queues • At least once Kafka • Masterless • Zookeeper • Long lived logs • At least once
  13. Production Transmission Augmentation Storage Query Collection HDFS • Self hosted

    • Cost effective at scale • Can run multi tenant • Non trivial operational cost S3 • Managed • Cost prohibitive at scale • Network to Map Reduce • EMR cost
  14. Production Transmission Augmentation Storage Query Collection Columnar Store • Redshift/Vertica

    • SQL • Expensive Key Value • Cassandra/Riak • Simple queries • Complex Hadoop Based • Pig/Hive/Spark • Shared tenancy • Highly Scalable
  15. Architecture Summary • Connected set of components • Different failure

    modes • Devils in the details Production Transmission Augmentation Storage Query Collection
  16. Event Sourcing • Share data not state • Decoupling producers/consumers

    • Materialize State • Scale with volume / complexity
  17. Conclusion • Motivated tracking • Stream of immutable data •

    Decouple producers / consumers • Mature ideas • Commoditized in the cloud