Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Essential ingredients for real time stream proc...

Essential ingredients for real time stream processing @Scale by Kartik pParamasivam at Big Data Spain 2015

At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.

We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.

Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html

Big Data Spain

October 22, 2015
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. About Me • ‘Streams Infrastructure’ at LinkedIn – Pub-sub messaging

    : Apache Kafka – Change Capture from various data systems: Databus – Stream Processing platform : Apache Samza • Previous – Built Microsoft Cloud Messaging (EventHub) and Enterprise Messaging(Queues/Topics) – .NET WebServices and Workflow stack – BizTalk Server
  2. Agenda • What is Stream Processing ? • Scenarios •

    Canonical Architecture • Essential Ingredients of Stream Processing • Close
  3. Agenda • Stream processing Intro • Scenarios • Canonical Architecture

    • Essential Ingredients of Stream Processing • Close
  4. Agenda • Stream processing Intro • Scenarios • Canonical Architecture

    • Essential Ingredients of Stream Processing • Close
  5. CANONICA L ARCHITECT URE Dat a- Bus Dat a- Bus

    Real Time Processing (Samza) Real Time Processing (Samza) Batch Processing (Hadoop/Spa rk) Batch Processing (Hadoop/Spa rk) Volde mort R/O Volde mort R/O e.g. Espress o e.g. Espress o Processing Bulk upload Espresso Espresso Services Tier Services Tier Ingestion Serving Clients(browser,devices, sensors ….) Kafk a Kafk a
  6. Agenda • Stream processing Intro • Scenarios • Canonical Architecture

    • Essential Ingredients of Stream Processing • Close
  7. Basics : Scaling Ingestion - Streams are partitioned - Messages

    sent to partitions based on PartitionKey - Time based message retention Stream A producers producers Pkey=10 consumerA (machine1) consumerA (machine1) consumerA (machine2) consumerA (machine2) Pkey=25 Pkey=45 e.g. Kafka, AWS Kinesis, Azure EventHub
  8. Scaling Processing.. E.g. Samza Stream A Task 1 Task 1

    Task 2 Task 2 Task 3 Task 3 Stream B Samza Job
  9. Horizontal Scaling is great ! But.. • But more machines

    means more $$ • Need to do more with less. • So what’s the key bottleneck during Event/Stream Processing ?
  10. Key Bottleneck: “Accessing Data” • Big impact on CPU, Network,

    Disk • Types of Data Access 1. Adjunct data – Read only data 2. Scratchpad/derived data - Read- Write data
  11. Adjunct Data – typical access Kafk a Kafk a AdClicks

    Processing Job Processing Job AdQuality update Kafk a Kafk a Membe r Databa se Membe r Databa se Read Member Info Concerns 1. Latency 2. CPU 3. Network 4. DDOS
  12. Scratch pad/Derived Data – typical access Kaf ka Kaf ka

    Sensor Data Processing Job Processing Job Alerts Kafk a Kafk a Device State Databa se Device State Databa se Concerns 1. Latency 2. CPU 3. Network 4. DDOS Read + Update per Device Info
  13. Adjunct Data – with Samza Kafk a Kafk a AdClicks

    Processing Job output Kafk a Kafk a Member Databas e (espress o) Member Databas e (espress o) Datab us Datab us Kafka, Databus, Database, Samza Job are all partitioned by MemberId Member Updates Task1 Task1 Task2 Task2 Task3 Task3 Rocks Db Rocks Db
  14. Fault Tolerance in a stateful Samza job P 0 P

    1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-A Host-B Host-C Changelog Stream Stable State
  15. Fault Tolerance in a stateful Samza job P 0 P

    1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-A Host-B Host-C Changelog Stream Host A dies/fails
  16. Fault Tolerance in a stateful Samza job P 0 P

    1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream YARN allocates the tasks to a container on a different host!
  17. Fault Tolerance in a stateful Samza job P 0 P

    1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream Restore local state by reading from the ChangeLog
  18. Fault Tolerance in a stateful Samza job P 0 P

    1 P 2 P 3 Task-0 Task-0 Task-1 Task-1 Task-2 Task-2 Task-3 Task-3 P 0 P 1 P 2 P 3 Host-E Host-B Host-C Changelog Stream Back to Stable State
  19. Hardware Spec: 24 cores, 1Gig NIC, SSD • (Baseline) Simple

    pass through job with no local state – 1.2 Million msg/sec • Samza job with local state – 400k msg/sec • Samza job with local state with Kafka backup – 300k msg/sec Performance Numbers with Samza
  20. Local State - Summary • Great for both read-only data

    and read-write data • Secret sauce to make local state work 1. Change Capture System: Databus/DynamoDB streams 2. Durable backup with Kafka Log Compacted topics
  21. Why do we need it ? • Software upgrades.. Yes

    bugs are a reality • Business logic changes • First time job deployment
  22. Reprocessing Data – with Samza output Kafk a Kafk a

    Member Databas e (espress o) Member Databas e (espress o) Datab us Datab us Member Updates Company/Title/ Location StandardIzatio n Job Company/Title/ Location StandardIzatio n Job Machin e Learnin g model Machin e Learnin g model bootstrap
  23. Reprocessing- Caveats • Stream processors are fast.. They can DOS

    the system if you reprocess – Control max-concurrency of your job – Quotas for Kafka, Databases – Async load into databases (Project Venice) • Capacity – Reprocessing a 100 TB source ? • Doesn’t reprocessing mean you are no- longer being real-time ?
  24. Essential Ingredients to Stream Processing 1.Scale but at not at

    any cost 2.Reprocessing 3.Accuracy of results 4.Easy to Program
  25. Querying over an infinite stream 1.00 pm Ad View Event

    1:01 pm Ad Click Event Ad Quality Processor Ad Quality Processor User1 Did user click the Ad within 2 minutes of seeing the Ad
  26. WHY DELAYS HAPPEN ? Ad Quality Processor (Samza) Ad Quality

    Processor (Samza) Services Tier Services Tier Kafk a Kafk a Services Tier Services Tier Ad Quality Processor (Samza) Ad Quality Processor (Samza) Kafk a Kafk a Mirrored kartik DATACENTE R 1 DATACENTE R 2 AdViewEve nt L B
  27. WHY DELAYS HAPPEN ? Real Time Processing (Samza) Real Time

    Processing (Samza) Services Tier Services Tier Kafk a Kafk a Services Tier Services Tier Real Time Processing (Samza) Real Time Processing (Samza) Kafk a Kafk a Mirrored kartik DATACENTE R 1 DATACENTE R 2 AdClick Event L B
  28. What do we need to do to get accurate results?

    Deal with • Late Arrivals – E.g. AdClick event showed up 5 minutes late. • Out of order arrival – E.g. AdClick event showed up before AdView event • Influenced by “Google MillWheel”
  29. Solution Kafk a Kafk a AdClicks Processing Job output Kaf

    ka Kaf ka Task1 Task1 Task2 Task2 Task3 Task3 Messag e Store Messag e Store Kafk a Kafk a AdView Messag e Store Messag e Store Messag e Store Messag e Store 1. All events are stored locally 2. Find impacted ‘window/s’ for late arrivals 3. Recompute result 4. Choose strategy for emitting results (absolute or relative
  30. Myth: This isn’t a problem with Lambda Architecture.. • Theory:

    Since the processing happens 1 hour or several hours later delays are not a problem. • Ok.. But what about the “edges” – Some “sessions” start before the cut off time for processing.. And end after the cut off time. – Delays and out of order processing make things worse on the edges
  31. Essential Ingredients to Stream Processing 1.Scale but at not at

    any cost 2.Reprocessing 3.Accuracy of results 4.Easy Programmability
  32. Easy Programmability • Support for “accurate” Windowing/Joins. ( Google Cloud

    Dataflow ) • Ability to express workflows/DAGs in config and DSL (e.g. Storm) • SQL support for querying over streams – Azure Stream Insight • Apache Samza – working on the above
  33. Agenda • Stream processing Intro • Scenarios • Canonical Architecture

    • Essential Ingredients of Stream Processing • Close
  34. Some scale numbers at LinkedIn • 1.3 Trillion Messages get

    ingested into Kafka per day – Each message gets consumed 4-5 times • Database change capture : – A few Trillion Messages get consumed per week • Samza jobs in production which process more than 1 Million messages/sec