Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time

Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time

We are enjoying something of a renaissance in data infrastructure. The old workhorses like MySQL and Oracle still exist but they are complemented by new specialized distributed data systems like Cassandra, Redis, Druid, and Hadoop. At the same time what we consider data has changed too--user activity, monitoring, logging and other event data are becoming first class citizens for data driven companies. Taking full advantage of all these systems and the relevant data creates a massive data integration problem. This problem is important to solve as these specialized systems are not very useful in the absence of a complete and reliable data flow.

One of the most powerful ways of solving this data integration problem is by restructuring your digital business logic around a centralized firehose of immutable events.

Once your data is captured in real-time and available as real-time subscriptions, you can start to compute new data sets in real-time, off these feeds. This style of stream processing is seen as something of a niche today but the model is extremely powerful and general. Much of what people compute offline in systems like Hadoop can also be done in real-time as data arrives using a stream-processing model. On top of these real-time data feeds, we can run continual processing and transformations to derive new data feeds (which are themselves logs) and publish these in the same way. We have open sourced our stream processing layer, Apache Samza[http://samza.incubator.apache.org/], which does this.

In this talk, I will share our experience of successfully building LinkedIn’s data pipeline infrastructure around Kafka and Samza. These lessons are hugely relevant to anyone building a data driven company.

C7f59de0d5062b4d704a47f9dbe91b66?s=128

nehanarkhede

June 14, 2015
Tweet

Transcript

  1. STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing

    billions of events every day
  2. Neha Narkhede ¨  Co-founder and Head of Engineering @ Stealth

    Startup ¨  Lead, Streams Infrastructure @ LI (Kafka & Samza) ¨  Apache Kafka committer and PMC member ¨  Reach out at @nehanarkhede
  3. Agenda ¨  Real-time Data Integration ¨  Introduction to Logs &

    Apache Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing
  4. Agenda ¨ Real-time Data Integration ¨  Introduction to Logs & Apache

    Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing
  5. Increase in diversity of data 1980+ 2000+ 2010+ Siloed data

    feeds Database data (users, products, orders etc) IoT sensors Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec)
  6. Explosion in diversity of systems ¨  Live Systems ¤  Voldemort

    ¤  Espresso ¤  GraphDB ¤  Search ¤  Samza ¨  Batch ¤  Hadoop ¤  Teradata
  7. Data integration disaster Oracle Oracle Oracle User Tracking Hadoop Log

    Search Monitoring Data Warehous e Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security
  8. Centralized service Oracle Oracle Oracle User Tracking Hadoop Log Search

    Monitorin g Data Warehous e Social Graph Rec Engine & Life Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security Data Pipeline
  9. Agenda ¨  Real-time Data Integration ¨ Introduction to Logs & Apache

    Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing
  10. Kafka at 10,000 ft ¨  Distributed from ground up ¨ 

    Persistent ¨  Multi-subscriber Cluster of brokers Producer Producer Producer Producer Producer Producer Producer Consumer Producer Consumer Producer Consumer
  11. Key design principles ¨  Scalability of a file system ¤ 

    Hundreds of MB/sec/server throughput ¤  Many TBs per server ¨  Guarantees of a database ¤  Messages strictly ordered ¤  All data persistent ¨  Distributed by default ¤  Replication model ¤  Partitioning model
  12. Apache Kafka @ LinkedIn ¨  175 TB of in-flight log

    data per colo ¨  Low-latency: ~1.5ms ¨  Replicated to each datacenter ¨  Tens of thousands of data producers ¨  Thousands of consumers ¨  7 million messages written/sec ¨  35 million messages read/sec ¨  Hadoop integration
  13. The data structure every systems engineer should know Logs

  14. The Log ¨  Ordered ¨  Append only ¨  Immutable 0

    1 2 3 4 5 6 7 8 9 10 11 12 1st record next record written
  15. The Log: Partitioning 0 1 2 3 4 5 6

    7 8 9 10 11 12 Partition 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 Partition 1 Partition 2 13 14 15 16
  16. Logs: pub/sub done right 0 1 2 3 4 5

    6 7 8 9 10 11 12 writes Data source Destination system A (time = 7) Destination system B (time = 11) reads reads
  17. Logs for data integration User updates profile with new job

    Newsfeed KAFKA Search Hadoop Standardization engine
  18. Agenda ¨  Real-time Data Integration ¨  Introduction to Logs &

    Apache Kafka ¨ Logs & Stream processing ¨  Apache Samza ¨  Stateful stream processing
  19. Stream processing = f(log) Log A Job 1 Log B

  20. Stream processing = f(log) Log A Job 1 Job 2

    Log B Log C Log D Log E
  21. Apache Samza at LinkedIn User updates profile with new job

    Newsfeed KAFKA Search Hadoop Standardization engine
  22. Latency spectrum of data systems Synchronous (milliseconds) RPC Batch (Hours)

    Latency Asynchronous processing (seconds to minutes)
  23. Agenda ¨  Real-time Data Integration ¨  Introduction to Logs &

    Apache Kafka ¨  Logs & Stream processing ¨ Apache Samza ¨  Stateful stream processing
  24. Samza API public interface StreamTask { void process (IncomingMessageEnvelope envelope,

    MessageCollector collector, TaskCoordinator coordinator); } getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()
  25. Samza Architecture (Logical view) Task 1 Task 2 Task 3

    Log A Log B partition 0 partition 1 partition 2 partition 0 partition 1
  26. Samza Architecture (Logical view) Task 1 Task 2 Task 3

    Log A Log B partition 0 partition 1 partition 2 partition 0 partition 1 Samza container 1 Samza container 2
  27. Samza Architecture (Physical view) Samza container 1 Samza container 2

    Host 1 Host 2
  28. Samza Architecture (Physical view) Samza container 1 Samza container 2

    Host 1 Host 2 Samza YARN AM Node manager Node manager
  29. Samza Architecture (Physical view) Samza container 1 Samza container 2

    Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka
  30. Map Reduce Map Reduce YARN AM Node manager Node manager

    HDFS HDFS Host 1 Host 2 Samza Architecture: Equivalence to Map Reduce
  31. M/R Operation Primitives ¨  Filter records matching some condition ¨ 

    Map record = f(record) ¨  Join Two/more datasets by key ¨  Group records with same key ¨  Aggregate f(records within the same group) ¨  Pipe job 1’s output => job 2’s input
  32. M/R Operation Primitives on streams ¨  Filter records matching some

    condition ¨  Map record = f(record) ¨  Join Two/more datasets by key ¨  Group records with same key ¨  Aggregate f(records within the same group) ¨  Pipe job 1’s output => job 2’s input Requires state maintenance
  33. Agenda ¨  Real-time Data Integration ¨  Introduction to Logs &

    Apache Kafka ¨  Logs & Stream processing ¨  Apache Samza ¨ Stateful stream processing
  34. Example: Newsfeed User 567 posted "Hello World" Status update log

    Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 989 posted "Blah Blah" User ... posted "..." External connection DB Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed
  35. Disk 100-500K msg/sec/node 100-500K msg/sec/node 1-5K queries/sec ?? ex: Cassandra,

    MongoDB, etc Remote state Samza task partition 0 Samza task partition 1 Local state vs Remote state: Remote ❌  Performance ❌  Isolation ❌  Limited APIs
  36. Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1

    Local LevelDB/RocksDB Local state: Bring data closer to computation
  37. Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1

    Local LevelDB/RocksDB Local state: Bring data closer to computation Disk Change log stream
  38. Example Revisited: Newsfeed User 567 posted "Hello World" Status update

    log New connection log Fan out messages to followers Push notification log 567 -> [123, 679, 789, ...] 999 -> [156, 343, ... ] User 123 followed 567 User 890 followed 234 User ... followed ... User 989 posted "Blah Blah" User ... posted "..." Refresh user 123's newsfeed Refresh user 679's newsfeed Refresh user ...'s newsfeed
  39. Fault tolerance? Samza container 1 Samza container 2 Host 1

    Host 2 Samza YARN AM Node manager Node manager Kafka Kafka
  40. Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1

    Local LevelDB/RocksDB Durable change log Fault tolerance in Samza
  41. Slow jobs Log A Job 1 Job 2 Log B

    Log C Log D Log E ❌  Drop data ❌  Backpressure ❌  Queue ❌ In memory ✅ On disk (KAFKA)
  42. Summary ¨  Real time data integration is crucial for the

    success and adoption of stream processing ¨  Logs form the basis for real time data integration ¨  Stream processing = f(logs) ¨  Samza differentiator => performance & fault- tolerant stateful stream processing
  43. Thank you! ¨  Logs ¤  http://bit.ly/the_log ¨  Apache Kafka ¤ 

    http://kafka.apache.org ¨  Apache Samza ¤  http://samza.incubator.apache.org ¨  Me ¤  @nehanarkhede ¤  http://www.linkedin.com/in/nehanarkhede