Liquid: Unifying nearline and offline big data integration

Liquid: Unifying nearline and offline big data integration

With more sophisticated data-parallel processing systems, the new bottleneck in data-intensive companies shifts from the back-end data systems to the data integration stack, which is responsible for the pre-processing of data for back-end applications. The use of back-end data systems with different access latencies and data in- tegration requirements poses new challenges that current data inte- gration stacks based on distributed file systems—proposed a decade ago for batch-oriented processing—cannot address.
In this paper, we describe Liquid, a data integration stack that provides low latency data access to support near real-time in ad- dition to batch applications. It supports incremental processing, and is cost-efficient and highly available. Liquid has two layers: a processing layer based on a stateful stream processing model, and a messaging layer with a highly-available publish/subscribe sys- tem. We report our experience of a Liquid deployment with back- end data systems at LinkedIn, a data-intensive company with over 300 million users.

C7f59de0d5062b4d704a47f9dbe91b66?s=128

nehanarkhede

January 06, 2015
Tweet

Transcript

  1. 2.

    Neha Narkhede ¨  Co-founder and Head of Engineering @ Confluent

    ¨  Prior to this… ¤  Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) ¤  Apache Kafka committer and PMC member ¨  Reach out at @nehanarkhede
  2. 3.

    Agenda ¨  Introduction ¨  Evolution of data integration ¨  Current

    state of support for nearline systems ¨  Liquid ¤  Messaging layer ¤  Processing layer ¨  Summary
  3. 4.

    Latency spectrum of data systems Synchronous (milliseconds) RPC Batch (Hours)

    Latency Asynchronous processing (seconds to minutes) Need for data integration
  4. 5.

    Nearline processing at LinkedIn User updates profile with new job

    Newsfeed Liquid Search Hadoop Standardization engine . . . . . .
  5. 7.

    Increase in diversity of data 1980+ 2000+ 2010+ Siloed data

    feeds Database data (users, products, orders etc) IoT sensors Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec)
  6. 8.

    Explosion in diversity of systems ¨  Live Systems ¤  Voldemort

    ¤  Espresso ¤  GraphDB ¤  Search ¤  Samza ¨  Batch ¤  Hadoop ¤  Teradata
  7. 11.

    Problems ¨  Data warehouse is a batch system ¨  Relational

    mapping is non-trivial ¨  Organizationally unscalable ¤  Central team that cleans all data? ¤  Labor intensive ¤  Doesn’t give good data coverage
  8. 13.

    Problem ¨  Rise in nearline processing systems requires low latency

    data integration ¨  Data integration stack is the new bottleneck ¨  MR/DFS based data integration stack does not address all processing systems ¤  Latency ¤  Incremental processing is not straightforward
  9. 16.

    Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  10. 17.

    Point-to-Point Pipelines Oracle Oracle Oracle User Tracking Hadoop Log Search

    Monitoring Data Warehous e Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security
  11. 18.

    Centralized Pipeline Oracle Oracle Oracle User Tracking Hadoop Log Search

    Monitorin g Data Warehous e Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security Data Pipeline
  12. 19.

    Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  13. 21.

    Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  14. 23.
  15. 24.

    Liquid Architecture Job Job Stateful job . . . .

    . . .... Tasks Processing Layer (Apache Samza) State Input Feed Output Feed Messaging Layer (Apache Kafka) Topic Topic Topic .... Partitions . . . . . . Data In Data Out
  16. 25.

    Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  17. 26.

    What is a commit log? 0 1 2 3 4

    5 6 7 8 9 10 11 12 1st record next record written Table Index Material ized view .... Seq: 10 Seq: 12 Seq: 12 Commit Log
  18. 27.

    Liquid: Messaging Layer Prod- ucer Broker Prod- ucer. Prod- ucer

    Broker Broker Partitioned Data Publication Consumer One Consumer Two Ordered Subscription
  19. 28.

    Liquid: Messaging Layer ¨  Topic-based publish subscribe ¨  Backed by

    a distributed commit log ¨  Rewindability ¨  Multi-subscriber ¨  Performance ¤  High throughput: 50 MB/s writes, 110 MB/s reads ¤  Low latency: 1.5 ms writes ¨  At LinkedIn: 500 billion messages/day
  20. 29.

    Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  21. 30.

    Metadata ¨  Problem ¤  Hundreds of message types ¤  Thousands

    of fields ¤  What do they all mean? ¤  What happens when they change? ¨  Solution ¤  Need a formal contract (Avro) ¤  Central repository of all schemas ¤  Programmatic compatibility model ¤  Reader always uses same schema as writer
  22. 31.

    Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  23. 32.

    Automated ETL ¨  Map/Reduce job does data load ¨  One

    job loads all events ¨  Hive registration done automatically ¨  Schema changes handled transparently ¨  ~5 minute lag on average to HDFS
  24. 33.

    Liquid: Processing Layer ¨  ETL-as-a-service ¤  Resource isolation ¤  Processing

    job isolation ¨  Incremental processing ¨  Fault tolerant stateful processing ¨  Apache Samza
  25. 34.

    Processing API public interface StreamTask { void process (IncomingMessageEnvelope envelope,

    MessageCollector collector, TaskCoordinator coordinator); } getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()
  26. 35.

    Processing Layer Architecture Task 1 Task 2 Task 3 Log

    A Log B partition 0 partition 1 partition 2 partition 0 partition 1
  27. 36.

    Resource isolation Task 1 Task 2 Task 3 Log A

    Log B partition 0 partition 1 partition 2 partition 0 partition 1 Samza container 1 Samza container 2
  28. 37.

    Processing isolation Log A Job 1 Job 2 Log B

    Log C Log D Log E ❌  Drop data ❌  Backpressure ❌  Queue ❌ In memory ✅ On disk (KAFKA)
  29. 38.

    Incremental processing ¨  Data access at message granularity using log

    sequence number ¨  Periodically checkpoint position in a special feed stored by the messaging layer
  30. 39.

    Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1

    Local LevelDB/RocksDB Durable change log Fault tolerant stateful processing
  31. 40.

    Summary ¨  Data integration needs have evolved with the rise

    of nearline processing ¨  DFS based data integration stacks fall short, leading to anti-patterns ¨  Liquid ¤  Data integration stack rethought ¤  Provides ETL-as-a-service for near real-time as well as batch applications ¤  Built on top of Apache Kafka and Apache Samza
  32. 41.

    Thank you! ¨  The Log ¤  http://bit.ly/the_log ¨  Apache Kafka

    ¤  http://kafka.apache.org ¨  Apache Samza ¤  http://samza.incubator.apache.org ¨  Me ¤  @nehanarkhede ¤  http://www.linkedin.com/in/nehanarkhede
  33. 44.

    Liquid use cases ¨  Data cleaning and normalization ¨  Site

    speed monitoring ¨  Call graph assembly ¨  Operational analysis
  34. 45.

    Performance Tricks ¨  Batching ¤  Producer ¤  Broker ¤  Consumer

    ¨  Avoid large in-memory structures ¤  Pagecache friendly ¨  Avoid data copying ¤  sendfile ¨  Batch Compression
  35. 46.

    Usage @ LinkedIn ¨  175 TB of in-flight log data

    per colo ¨  Low-latency: ~1.5ms ¨  Replicated to each datacenter ¨  Tens of thousands of data producers ¨  Thousands of consumers ¨  7 million messages written/sec ¨  35 million messages read/sec ¨  Hadoop integration
  36. 47.

    Samza Architecture (Physical view) Samza container 1 Samza container 2

    Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka
  37. 48.

    Map Reduce Map Reduce YARN AM Node manager Node manager

    HDFS HDFS Host 1 Host 2 Samza Architecture: Equivalence to Map Reduce