Essential ingredients for real time stream processing @Scale by Kartik pParamasivam at Big Data Spain 2015

Essential Ingredients of Stream Processing @ Scale Kartik Paramasivam

About Me • ‘Streams Infrastructure’ at LinkedIn – Pub-sub messaging
: Apache Kafka – Change Capture from various data systems: Databus – Stream Processing platform : Apache Samza • Previous – Built Microsoft Cloud Messaging (EventHub) and Enterprise Messaging(Queues/Topics) – .NET WebServices and Workflow stack – BizTalk Server

Agenda • What is Stream Processing ? • Scenarios •
Canonical Architecture • Essential Ingredients of Stream Processing • Close

Response latency Stream processing Milliseconds to minutes RPC Synchronous Later.
Possibly much later. 0 ms

Agenda • Stream processing Intro • Scenarios • Canonical Architecture
• Essential Ingredients of Stream Processing • Close

Newsfeed

Cyber-security

Internet of Things

CANONICA L ARCHITECT URE Dat a- Bus Dat a- Bus
Real Time Processing (Samza) Real Time Processing (Samza) Batch Processing (Hadoop/Spa rk) Batch Processing (Hadoop/Spa rk) Volde mort R/O Volde mort R/O e.g. Espress o e.g. Espress o Processing Bulk upload Espresso Espresso Services Tier Services Tier Ingestion Serving Clients(browser,devices, sensors ….) Kafk a Kafk a

Essential Ingredients to Stream Processing 1.Scale 2.Reprocessing 3.Accuracy of results
4.Easy to program

SCALE.. but not at any cost

Basics : Scaling Ingestion - Streams are partitioned - Messages
sent to partitions based on PartitionKey - Time based message retention Stream A producers producers Pkey=10 consumerA (machine1) consumerA (machine1) consumerA (machine2) consumerA (machine2) Pkey=25 Pkey=45 e.g. Kafka, AWS Kinesis, Azure EventHub

Scaling Processing.. E.g. Samza Stream A Task 1 Task 1
Task 2 Task 2 Task 3 Task 3 Stream B Samza Job

Samza – Streaming Dataflow Stream A Stream c Stream D
Job 1 Job 2 Stream B

Horizontal Scaling is great ! But.. • But more machines
means more $$ • Need to do more with less. • So what’s the key bottleneck during Event/Stream Processing ?

Key Bottleneck: “Accessing Data” • Big impact on CPU, Network,
Disk • Types of Data Access 1. Adjunct data – Read only data 2. Scratchpad/derived data - Read- Write data

Adjunct Data – typical access Kafk a Kafk a AdClicks
Processing Job Processing Job AdQuality update Kafk a Kafk a Membe r Databa se Membe r Databa se Read Member Info Concerns 1. Latency 2. CPU 3. Network 4. DDOS

Scratch pad/Derived Data – typical access Kaf ka Kaf ka
Sensor Data Processing Job Processing Job Alerts Kafk a Kafk a Device State Databa se Device State Databa se Concerns 1. Latency 2. CPU 3. Network 4. DDOS Read + Update per Device Info

Adjunct Data – with Samza Kafk a Kafk a AdClicks
Processing Job output Kafk a Kafk a Member Databas e (espress o) Member Databas e (espress o) Datab us Datab us Kafka, Databus, Database, Samza Job are all partitioned by MemberId Member Updates Task1 Task1 Task2 Task2 Task3 Task3 Rocks Db Rocks Db

Fault Tolerance in a stateful Samza job P 0 P
1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-A Host-B Host-C Changelog Stream Stable State

1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-A Host-B Host-C Changelog Stream Host A dies/fails

1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream YARN allocates the tasks to a container on a different host!

1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream Restore local state by reading from the ChangeLog

1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream Back to Stable State

Hardware Spec: 24 cores, 1Gig NIC, SSD • (Baseline) Simple
pass through job with no local state – 1.2 Million msg/sec • Samza job with local state – 400k msg/sec • Samza job with local state with Kafka backup – 300k msg/sec Performance Numbers with Samza

Local State - Summary • Great for both read-only data
and read-write data • Secret sauce to make local state work 1. Change Capture System: Databus/DynamoDB streams 2. Durable backup with Kafka Log Compacted topics

Essential Ingredients to Stream Processing 1.Scale 2.Reprocessing 3.Accuracy of results
4.Easy to program

REPROCESSING

Why do we need it ? • Software upgrades.. Yes
bugs are a reality • Business logic changes • First time job deployment

Reprocessing Data – with Samza output Kafk a Kafk a
Member Databas e (espress o) Member Databas e (espress o) Datab us Datab us Member Updates Company/Title/ Location StandardIzatio n Job Company/Title/ Location StandardIzatio n Job Machin e Learnin g model Machin e Learnin g model bootstrap

Reprocessing- Caveats • Stream processors are fast.. They can DOS
the system if you reprocess – Control max-concurrency of your job – Quotas for Kafka, Databases – Async load into databases (Project Venice) • Capacity – Reprocessing a 100 TB source ? • Doesn’t reprocessing mean you are no- longer being real-time ?

Essential Ingredients to Stream Processing 1.Scale but at not at
any cost 2.Reprocessing 3.Accuracy of results 4.Easy to Program

ACCURACY OF RESULTS

Querying over an infinite stream 1.00 pm Ad View Event
1:01 pm Ad Click Event Ad Quality Processor Ad Quality Processor User1 Did user click the Ad within 2 minutes of seeing the Ad

WHY DELAYS HAPPEN ? Ad Quality Processor (Samza) Ad Quality
Processor (Samza) Services Tier Services Tier Kafk a Kafk a Services Tier Services Tier Ad Quality Processor (Samza) Ad Quality Processor (Samza) Kafk a Kafk a Mirrored kartik DATACENTE R 1 DATACENTE R 2 AdViewEve nt L B

WHY DELAYS HAPPEN ? Real Time Processing (Samza) Real Time
Processing (Samza) Services Tier Services Tier Kafk a Kafk a Services Tier Services Tier Real Time Processing (Samza) Real Time Processing (Samza) Kafk a Kafk a Mirrored kartik DATACENTE R 1 DATACENTE R 2 AdClick Event L B

What do we need to do to get accurate results?
Deal with • Late Arrivals – E.g. AdClick event showed up 5 minutes late. • Out of order arrival – E.g. AdClick event showed up before AdView event • Influenced by “Google MillWheel”

Solution Kafk a Kafk a AdClicks Processing Job output Kaf
ka Kaf ka Task1 Task1 Task2 Task2 Task3 Task3 Messag e Store Messag e Store Kafk a Kafk a AdView Messag e Store Messag e Store Messag e Store Messag e Store 1. All events are stored locally 2. Find impacted ‘window/s’ for late arrivals 3. Recompute result 4. Choose strategy for emitting results (absolute or relative

Myth: This isn’t a problem with Lambda Architecture.. • Theory:
Since the processing happens 1 hour or several hours later delays are not a problem. • Ok.. But what about the “edges” – Some “sessions” start before the cut off time for processing.. And end after the cut off time. – Delays and out of order processing make things worse on the edges

Essential Ingredients to Stream Processing 1.Scale but at not at
any cost 2.Reprocessing 3.Accuracy of results 4.Easy Programmability

Easy Programmability • Support for “accurate” Windowing/Joins. ( Google Cloud
Dataflow ) • Ability to express workflows/DAGs in config and DSL (e.g. Storm) • SQL support for querying over streams – Azure Stream Insight • Apache Samza – working on the above

Some scale numbers at LinkedIn • 1.3 Trillion Messages get
ingested into Kafka per day – Each message gets consumed 4-5 times • Database change capture : – A few Trillion Messages get consumed per week • Samza jobs in production which process more than 1 Million messages/sec

References • http://samza.apache.org/ • http://kafka.apache.org/ • https://github.com/linkedin/databus • http://cs.brown.edu/~ugur/ 8rulesSigRec.pdf
• http://www.cs.cmu.edu/~pavlo/courses /fall2013/static/papers/p734- akidau.pdf

Thank You!

Essential ingredients for real time stream proc...

Essential ingredients for real time stream processing @Scale by Kartik pParamasivam at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript