Liquid: Unifying nearline and offline big data integration

Liquid: Unifying nearline and offline big data integration

With more sophisticated data-parallel processing systems, the new bottleneck in data-intensive companies shifts from the back-end data systems to the data integration stack, which is responsible for the pre-processing of data for back-end applications. The use of back-end data systems with different access latencies and data in- tegration requirements poses new challenges that current data inte- gration stacks based on distributed file systems—proposed a decade ago for batch-oriented processing—cannot address.
In this paper, we describe Liquid, a data integration stack that provides low latency data access to support near real-time in ad- dition to batch applications. It supports incremental processing, and is cost-efficient and highly available. Liquid has two layers: a processing layer based on a stateful stream processing model, and a messaging layer with a highly-available publish/subscribe sys- tem. We report our experience of a Liquid deployment with back- end data systems at LinkedIn, a data-intensive company with over 300 million users.

C7f59de0d5062b4d704a47f9dbe91b66?s=128

nehanarkhede

January 06, 2015
Tweet

Transcript

  1. LIQUID Unifying nearline and offline big data integration

  2. Neha Narkhede ¨  Co-founder and Head of Engineering @ Confluent

    ¨  Prior to this… ¤  Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) ¤  Apache Kafka committer and PMC member ¨  Reach out at @nehanarkhede
  3. Agenda ¨  Introduction ¨  Evolution of data integration ¨  Current

    state of support for nearline systems ¨  Liquid ¤  Messaging layer ¤  Processing layer ¨  Summary
  4. Latency spectrum of data systems Synchronous (milliseconds) RPC Batch (Hours)

    Latency Asynchronous processing (seconds to minutes) Need for data integration
  5. Nearline processing at LinkedIn User updates profile with new job

    Newsfeed Liquid Search Hadoop Standardization engine . . . . . .
  6. 2 trends impact data integration

  7. Increase in diversity of data 1980+ 2000+ 2010+ Siloed data

    feeds Database data (users, products, orders etc) IoT sensors Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec)
  8. Explosion in diversity of systems ¨  Live Systems ¤  Voldemort

    ¤  Espresso ¤  GraphDB ¤  Search ¤  Samza ¨  Batch ¤  Hadoop ¤  Teradata
  9. Evolution of data integration

  10. The Enterprise Data Warehouse

  11. Problems ¨  Data warehouse is a batch system ¨  Relational

    mapping is non-trivial ¨  Organizationally unscalable ¤  Central team that cleans all data? ¤  Labor intensive ¤  Doesn’t give good data coverage
  12. DFS based data integration

  13. Problem ¨  Rise in nearline processing systems requires low latency

    data integration ¨  Data integration stack is the new bottleneck ¨  MR/DFS based data integration stack does not address all processing systems ¤  Latency ¤  Incremental processing is not straightforward
  14. Current state of support for nearline systems

  15. Data integration anti-pattern: Each system has its own source of

    truth
  16. Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  17. Point-to-Point Pipelines Oracle Oracle Oracle User Tracking Hadoop Log Search

    Monitoring Data Warehous e Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security
  18. Centralized Pipeline Oracle Oracle Oracle User Tracking Hadoop Log Search

    Monitorin g Data Warehous e Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security Data Pipeline
  19. Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  20. Lambda Architecture

  21. Support for Nearline systems ¨  Bypass the data integration layer

    ¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture
  22. Kappa Architecture

  23. Liquid

  24. Liquid Architecture Job Job Stateful job . . . .

    . . .... Tasks Processing Layer (Apache Samza) State Input Feed Output Feed Messaging Layer (Apache Kafka) Topic Topic Topic .... Partitions . . . . . . Data In Data Out
  25. Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  26. What is a commit log? 0 1 2 3 4

    5 6 7 8 9 10 11 12 1st record next record written Table Index Material ized view .... Seq: 10 Seq: 12 Seq: 12 Commit Log
  27. Liquid: Messaging Layer Prod- ucer Broker Prod- ucer. Prod- ucer

    Broker Broker Partitioned Data Publication Consumer One Consumer Two Ordered Subscription
  28. Liquid: Messaging Layer ¨  Topic-based publish subscribe ¨  Backed by

    a distributed commit log ¨  Rewindability ¨  Multi-subscriber ¨  Performance ¤  High throughput: 50 MB/s writes, 110 MB/s reads ¤  Low latency: 1.5 ms writes ¨  At LinkedIn: 500 billion messages/day
  29. Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  30. Metadata ¨  Problem ¤  Hundreds of message types ¤  Thousands

    of fields ¤  What do they all mean? ¤  What happens when they change? ¨  Solution ¤  Need a formal contract (Avro) ¤  Central repository of all schemas ¤  Programmatic compatibility model ¤  Reader always uses same schema as writer
  31. Key takeaways 1  Central commit log for all data 2 

    Push data cleanliness upstream 3  O(1) ETL
  32. Automated ETL ¨  Map/Reduce job does data load ¨  One

    job loads all events ¨  Hive registration done automatically ¨  Schema changes handled transparently ¨  ~5 minute lag on average to HDFS
  33. Liquid: Processing Layer ¨  ETL-as-a-service ¤  Resource isolation ¤  Processing

    job isolation ¨  Incremental processing ¨  Fault tolerant stateful processing ¨  Apache Samza
  34. Processing API public interface StreamTask { void process (IncomingMessageEnvelope envelope,

    MessageCollector collector, TaskCoordinator coordinator); } getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()
  35. Processing Layer Architecture Task 1 Task 2 Task 3 Log

    A Log B partition 0 partition 1 partition 2 partition 0 partition 1
  36. Resource isolation Task 1 Task 2 Task 3 Log A

    Log B partition 0 partition 1 partition 2 partition 0 partition 1 Samza container 1 Samza container 2
  37. Processing isolation Log A Job 1 Job 2 Log B

    Log C Log D Log E ❌  Drop data ❌  Backpressure ❌  Queue ❌ In memory ✅ On disk (KAFKA)
  38. Incremental processing ¨  Data access at message granularity using log

    sequence number ¨  Periodically checkpoint position in a special feed stored by the messaging layer
  39. Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1

    Local LevelDB/RocksDB Durable change log Fault tolerant stateful processing
  40. Summary ¨  Data integration needs have evolved with the rise

    of nearline processing ¨  DFS based data integration stacks fall short, leading to anti-patterns ¨  Liquid ¤  Data integration stack rethought ¤  Provides ETL-as-a-service for near real-time as well as batch applications ¤  Built on top of Apache Kafka and Apache Samza
  41. Thank you! ¨  The Log ¤  http://bit.ly/the_log ¨  Apache Kafka

    ¤  http://kafka.apache.org ¨  Apache Samza ¤  http://samza.incubator.apache.org ¨  Me ¤  @nehanarkhede ¤  http://www.linkedin.com/in/nehanarkhede
  42. Extra content

  43. Real world use cases

  44. Liquid use cases ¨  Data cleaning and normalization ¨  Site

    speed monitoring ¨  Call graph assembly ¨  Operational analysis
  45. Performance Tricks ¨  Batching ¤  Producer ¤  Broker ¤  Consumer

    ¨  Avoid large in-memory structures ¤  Pagecache friendly ¨  Avoid data copying ¤  sendfile ¨  Batch Compression
  46. Usage @ LinkedIn ¨  175 TB of in-flight log data

    per colo ¨  Low-latency: ~1.5ms ¨  Replicated to each datacenter ¨  Tens of thousands of data producers ¨  Thousands of consumers ¨  7 million messages written/sec ¨  35 million messages read/sec ¨  Hadoop integration
  47. Samza Architecture (Physical view) Samza container 1 Samza container 2

    Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka
  48. Map Reduce Map Reduce YARN AM Node manager Node manager

    HDFS HDFS Host 1 Host 2 Samza Architecture: Equivalence to Map Reduce