Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture for Behavioural Tracking

Architecture for Behavioural Tracking

This talk covers a the architecture of a data pipeline from the front end integration to the report generation.

45a07e96e3060fb22f7db34d9edf2c14?s=128

Sean Braithwaite

November 20, 2014
Tweet

Transcript

  1. Behavioural Tracking Architecture for data pipelines

  2. Outline • Motivation • Technical requirements • Architecture • Example:

    Stitch • Event sourcing • Alternative implementation • Open problems • Q and A
  3. None
  4. Text

  5. Tracking Pixels • Sending data from browser • Backbone of

    Ad-tech • Ubiquitous throughout the web
  6. Query Operational • Logs • Payment reports Insight • Analytics

    • Research Product • Creator analytics • Recommendation Motivation for tracking
  7. Technical Requirements • Collect and transmit data • Partition, process,

    filter • Make available to people
  8. Production Architecture Overview Query Augmentation Storage Transmission Collection

  9. Client side implementation • Instrument application logic/state • Schedule +

    plan for HTTP Calls • Handle Failure Production Transmission Augmentation Storage Query Collection
  10. Production Transmission Augmentation Storage Query Collection Platforms Matter • Battery

    life on mobile is an issue • Keep alive, Scheduling, Batching • Device ID and local storage
  11. Production Transmission Augmentation Storage Query Collection Collection • Front door

    of the pipeline • Highly available • Consistent application protocol
  12. Production Transmission Augmentation Storage Query Application Protocol • Schemas matter

    • Must communicate failure • Evolutionary path Collection
  13. Production Transmission Augmentation Storage Query HA HTTP services • DNS

    offers high level routing • Load Balancers • Circuit Breakers Collection
  14. Production Transmission Augmentation Storage Query foo.com 10.0.1.1 10.0.1.2 collector-0.foo.com collector-1.foo.com

    collector-2.foo.com collector-3.foo.com DNS Record Load Balancers Application Servers Collection
  15. Production Transmission Augmentation Storage Query Collection Transmission • Delivery guarantees

    • Highly available • Queuing behaviour
  16. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Ideal Case Valid Data
  17. Production Transmission Augmentation Storage Query Collection Sender Receiver Message At

    most once ACK Failure Unknown
  18. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Duplicate Case: at least once Resend Message Message Message
  19. Production Transmission Augmentation Storage Query Collection Sender Receiver Message ACK

    Failure Exactly Once*: De-duplication Duplicate Message Filter
  20. Production Transmission Augmentation Storage Query Collection RabbitMQ • Master/Slave •

    Complex Topology • Short lived queues • At least once Kafka • Masterless • Zookeeper • Long lived logs • At least once
  21. Production Transmission Augmentation Storage Query Collection Augmentation • Transformation •

    Fan-out • Fan-in
  22. Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Augmentation

  23. Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Classifier

    Validated
  24. Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Fan-out

    a b c
  25. Production Transmission Augmentation Storage Query Collection Transmission Storage Collection Fan-in

    a b c Composite event
  26. Production Transmission Augmentation Storage Query Collection Transmission Writer Storage /data/2014/08/01.seq

    /data/2014/08/02.seq /data/2014/08/03.seq /data/2014/08/03.seq
  27. Production Transmission Augmentation Storage Query Collection Storage • Source of

    truth • Scalable • Replicated • Available
  28. Production Transmission Augmentation Storage Query Collection HDFS • Self hosted

    • Cost effective at scale • Can run multi tenant • Non trivial operational cost S3 • Managed • Cost prohibitive at scale • Network to Map Reduce • EMR cost
  29. Production Transmission Augmentation Storage Query Collection Query • Database •

    Low vs. high latency • Common vs custom operations
  30. Production Transmission Augmentation Storage Query Collection Columnar Store • Redshift/Vertica

    • SQL • Expensive Key Value • Cassandra/Riak • Simple queries • Complex Hadoop Based • Pig/Hive/Spark • Shared tenancy • Highly Scalable
  31. Architecture Summary • Connected set of components • Different failure

    modes • Devils in the details Production Transmission Augmentation Storage Query Collection
  32. Example: Stitch • Thousands writes/sec • 100k cassandra reads/sec •

    Billions of counts
  33. Clients Stitch Cassandra Aggregation HDFS RabbitMQ RoR

  34. Event Sourcing • Share data not state • Decoupling producers/consumers

    • Materialize State • Scale with volume / complexity
  35. Application A State Immutable State Application B Log State Log

    read append append read Materialize
  36. Lamdba Architecture • Developed by Nathan Marz • Real time

    and batch • Scalable and reliable
  37. None
  38. None
  39. Conclusion • Motivated tracking • Stream of immutable data •

    Decouple producers / consumers • Mature ideas • Commoditized in the cloud
  40. Thank you