Technical Requirements
• Collect and transmit data
• Partition, process, filter
• Make available to people
Slide 8
Slide 8 text
Production
Architecture Overview
Query
Augmentation Storage
Transmission
Collection
Slide 9
Slide 9 text
Client side implementation
• Instrument application logic/state
• Schedule + plan for HTTP Calls
• Handle Failure
Production Transmission Augmentation Storage Query
Collection
Slide 10
Slide 10 text
Production Transmission Augmentation Storage Query
Collection
Platforms Matter
• Battery life on mobile is an issue
• Keep alive, Scheduling, Batching
• Device ID and local storage
Slide 11
Slide 11 text
Production Transmission Augmentation Storage Query
Collection
Collection
• Front door of the pipeline
• Highly available
• Consistent application protocol
Slide 12
Slide 12 text
Production Transmission Augmentation Storage Query
Application Protocol
• Schemas matter
• Must communicate failure
• Evolutionary path
Collection
Slide 13
Slide 13 text
Production Transmission Augmentation Storage Query
HA HTTP services
• DNS offers high level routing
• Load Balancers
• Circuit Breakers
Collection
Slide 14
Slide 14 text
Production Transmission Augmentation Storage Query
foo.com
10.0.1.1
10.0.1.2
collector-0.foo.com
collector-1.foo.com
collector-2.foo.com
collector-3.foo.com
DNS Record
Load Balancers
Application Servers
Collection
Slide 15
Slide 15 text
Production Transmission Augmentation Storage Query
Collection
Transmission
• Delivery guarantees
• Highly available
• Queuing behaviour
Slide 16
Slide 16 text
Production Transmission Augmentation Storage Query
Collection
Sender Receiver
Message
ACK
Ideal Case
Valid Data
Slide 17
Slide 17 text
Production Transmission Augmentation Storage Query
Collection
Sender Receiver
Message
At most once
ACK Failure
Unknown
Slide 18
Slide 18 text
Production Transmission Augmentation Storage Query
Collection
Sender Receiver
Message
ACK Failure
Duplicate Case: at least once
Resend Message
Message
Message
Production Transmission Augmentation Storage Query
Collection
RabbitMQ
• Master/Slave
• Complex Topology
• Short lived queues
• At least once
Kafka
• Masterless
• Zookeeper
• Long lived logs
• At least once
Production Transmission Augmentation Storage Query
Collection
Storage
• Source of truth
• Scalable
• Replicated
• Available
Slide 28
Slide 28 text
Production Transmission Augmentation Storage Query
Collection
HDFS
• Self hosted
• Cost effective at scale
• Can run multi tenant
• Non trivial operational cost
S3
• Managed
• Cost prohibitive at scale
• Network to Map Reduce
• EMR cost
Slide 29
Slide 29 text
Production Transmission Augmentation Storage Query
Collection
Query
• Database
• Low vs. high latency
• Common vs custom operations
Slide 30
Slide 30 text
Production Transmission Augmentation Storage Query
Collection
Columnar Store
• Redshift/Vertica
• SQL
• Expensive
Key Value
• Cassandra/Riak
• Simple queries
• Complex
Hadoop Based
• Pig/Hive/Spark
• Shared tenancy
• Highly Scalable
Slide 31
Slide 31 text
Architecture Summary
• Connected set of components
• Different failure modes
• Devils in the details
Production Transmission Augmentation Storage Query
Collection