Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture for Behavioural Tracking

Architecture for Behavioural Tracking

This talk covers a the architecture of a data pipeline from the front end integration to the report generation.

Avatar for Sean Braithwaite

Sean Braithwaite

November 20, 2014
Tweet

More Decks by Sean Braithwaite

Other Decks in Programming

Transcript

  1. Outline • Motivation • Technical requirements • Architecture • Example:

    Stitch • Event sourcing • Alternative implementation • Open problems • Q and A
  2. Tracking Pixels • Sending data from browser • Backbone of

    Ad-tech • Ubiquitous throughout the web
  3. Query Operational • Logs • Payment reports Insight • Analytics

    • Research Product • Creator analytics • Recommendation Motivation for tracking
  4. Client side implementation • Instrument application logic/state • Schedule +

    plan for HTTP Calls • Handle Failure Production Transmission Augmentation Storage Query Collection
  5. Production Transmission Augmentation Storage Query Collection Platforms Matter • Battery

    life on mobile is an issue • Keep alive, Scheduling, Batching • Device ID and local storage
  6. Production Transmission Augmentation Storage Query Collection Collection • Front door

    of the pipeline • Highly available • Consistent application protocol
  7. Production Transmission Augmentation Storage Query Application Protocol • Schemas matter

    • Must communicate failure • Evolutionary path Collection
  8. Production Transmission Augmentation Storage Query HA HTTP services • DNS

    offers high level routing • Load Balancers • Circuit Breakers Collection
  9. Production Transmission Augmentation Storage Query foo.com 10.0.1.1 10.0.1.2 collector-0.foo.com collector-1.foo.com

    collector-2.foo.com collector-3.foo.com DNS Record Load Balancers Application Servers Collection
  10. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Duplicate Case: at least once Resend Message Message Message
  11. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Exactly Once*: De-duplication Duplicate Message Filter
  12. Production Transmission Augmentation Storage Query Collection RabbitMQ • Master/Slave •

    Complex Topology • Short lived queues • At least once Kafka • Masterless • Zookeeper • Long lived logs • At least once
  13. Production Transmission Augmentation Storage Query Collection HDFS • Self hosted

    • Cost effective at scale • Can run multi tenant • Non trivial operational cost S3 • Managed • Cost prohibitive at scale • Network to Map Reduce • EMR cost
  14. Production Transmission Augmentation Storage Query Collection Columnar Store • Redshift/Vertica

    • SQL • Expensive Key Value • Cassandra/Riak • Simple queries • Complex Hadoop Based • Pig/Hive/Spark • Shared tenancy • Highly Scalable
  15. Architecture Summary • Connected set of components • Different failure

    modes • Devils in the details Production Transmission Augmentation Storage Query Collection
  16. Event Sourcing • Share data not state • Decoupling producers/consumers

    • Materialize State • Scale with volume / complexity
  17. Conclusion • Motivated tracking • Stream of immutable data •

    Decouple producers / consumers • Mature ideas • Commoditized in the cloud