Liquid: Unifying nearline and offline big data integration

LIQUID Unifying nearline and offline big data integration

Neha Narkhede ¨  Co-founder and Head of Engineering @ Confluent
¨  Prior to this… ¤  Lead, Streams Infrastructure @ LinkedIn (Kafka & Samza) ¤  Apache Kafka committer and PMC member ¨  Reach out at @nehanarkhede

Agenda ¨  Introduction ¨  Evolution of data integration ¨  Current
state of support for nearline systems ¨  Liquid ¤  Messaging layer ¤  Processing layer ¨  Summary

Latency spectrum of data systems Synchronous (milliseconds) RPC Batch (Hours)
Latency Asynchronous processing (seconds to minutes) Need for data integration

Nearline processing at LinkedIn User updates proﬁle with new job
Newsfeed Liquid Search Hadoop Standardization engine . . . . . .

2 trends impact data integration

Increase in diversity of data 1980+ 2000+ 2010+ Siloed data
feeds Database data (users, products, orders etc) IoT sensors Events (clicks, impressions, pageviews) Application logs (errors, service calls) Application metrics (CPU usage, requests/sec)

Explosion in diversity of systems ¨  Live Systems ¤  Voldemort
¤  Espresso ¤  GraphDB ¤  Search ¤  Samza ¨  Batch ¤  Hadoop ¤  Teradata

Evolution of data integration

The Enterprise Data Warehouse

Problems ¨  Data warehouse is a batch system ¨  Relational
mapping is non-trivial ¨  Organizationally unscalable ¤  Central team that cleans all data? ¤  Labor intensive ¤  Doesn’t give good data coverage

DFS based data integration

Problem ¨  Rise in nearline processing systems requires low latency
data integration ¨  Data integration stack is the new bottleneck ¨  MR/DFS based data integration stack does not address all processing systems ¤  Latency ¤  Incremental processing is not straightforward

Current state of support for nearline systems

Data integration anti-pattern: Each system has its own source of
truth

Support for Nearline systems ¨  Bypass the data integration layer
¤  Point-to-point connections ¤  Reduces reusability of data ¨  Architectural patterns ¤  Lambda architecture ¤  Kappa architecture

Point-to-Point Pipelines Oracle Oracle Oracle User Tracking Hadoop Log Search
Monitoring Data Warehous e Social Graph Rec. Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security

Centralized Pipeline Oracle Oracle Oracle User Tracking Hadoop Log Search
Monitorin g Data Warehous e Social Graph Rec Engine Search Email Voldemort Voldemort Voldemort Espresso Espresso Espresso Logs Operational Metrics Production Services ... Security Data Pipeline

Lambda Architecture

Kappa Architecture

Liquid

Liquid Architecture Job Job Stateful job . . . .
. . .... Tasks Processing Layer (Apache Samza) State Input Feed Output Feed Messaging Layer (Apache Kafka) Topic Topic Topic .... Partitions . . . . . . Data In Data Out

Key takeaways 1  Central commit log for all data 2 
Push data cleanliness upstream 3  O(1) ETL

What is a commit log? 0 1 2 3 4
5 6 7 8 9 10 11 12 1st record next record written Table Index Material ized view .... Seq: 10 Seq: 12 Seq: 12 Commit Log

Liquid: Messaging Layer Prod- ucer Broker Prod- ucer. Prod- ucer
Broker Broker Partitioned Data Publication Consumer One Consumer Two Ordered Subscription

Liquid: Messaging Layer ¨  Topic-based publish subscribe ¨  Backed by
a distributed commit log ¨  Rewindability ¨  Multi-subscriber ¨  Performance ¤  High throughput: 50 MB/s writes, 110 MB/s reads ¤  Low latency: 1.5 ms writes ¨  At LinkedIn: 500 billion messages/day

Metadata ¨  Problem ¤  Hundreds of message types ¤  Thousands
of fields ¤  What do they all mean? ¤  What happens when they change? ¨  Solution ¤  Need a formal contract (Avro) ¤  Central repository of all schemas ¤  Programmatic compatibility model ¤  Reader always uses same schema as writer

Automated ETL ¨  Map/Reduce job does data load ¨  One
job loads all events ¨  Hive registration done automatically ¨  Schema changes handled transparently ¨  ~5 minute lag on average to HDFS

Liquid: Processing Layer ¨  ETL-as-a-service ¤  Resource isolation ¤  Processing
job isolation ¨  Incremental processing ¨  Fault tolerant stateful processing ¨  Apache Samza

Processing API public interface StreamTask { void process (IncomingMessageEnvelope envelope,
MessageCollector collector, TaskCoordinator coordinator); } getKey(), getMsg() sendMsg(topic, key, value) commit(), shutdown()

Processing Layer Architecture Task 1 Task 2 Task 3 Log
A Log B partition 0 partition 1 partition 2 partition 0 partition 1

Resource isolation Task 1 Task 2 Task 3 Log A
Log B partition 0 partition 1 partition 2 partition 0 partition 1 Samza container 1 Samza container 2

Processing isolation Log A Job 1 Job 2 Log B
Log C Log D Log E ❌  Drop data ❌  Backpressure ❌  Queue ❌ In memory ✅ On disk (KAFKA)

Incremental processing ¨  Data access at message granularity using log
sequence number ¨  Periodically checkpoint position in a special feed stored by the messaging layer

Local LevelDB/RocksDB Samza task partition 0 Samza task partition 1
Local LevelDB/RocksDB Durable change log Fault tolerant stateful processing

Summary ¨  Data integration needs have evolved with the rise
of nearline processing ¨  DFS based data integration stacks fall short, leading to anti-patterns ¨  Liquid ¤  Data integration stack rethought ¤  Provides ETL-as-a-service for near real-time as well as batch applications ¤  Built on top of Apache Kafka and Apache Samza

Thank you! ¨  The Log ¤  http://bit.ly/the_log ¨  Apache Kafka
¤  http://kafka.apache.org ¨  Apache Samza ¤  http://samza.incubator.apache.org ¨  Me ¤  @nehanarkhede ¤  http://www.linkedin.com/in/nehanarkhede

Extra content

Real world use cases

Liquid use cases ¨  Data cleaning and normalization ¨  Site
speed monitoring ¨  Call graph assembly ¨  Operational analysis

Performance Tricks ¨  Batching ¤  Producer ¤  Broker ¤  Consumer
¨  Avoid large in-memory structures ¤  Pagecache friendly ¨  Avoid data copying ¤  sendfile ¨  Batch Compression

Usage @ LinkedIn ¨  175 TB of in-flight log data
per colo ¨  Low-latency: ~1.5ms ¨  Replicated to each datacenter ¨  Tens of thousands of data producers ¨  Thousands of consumers ¨  7 million messages written/sec ¨  35 million messages read/sec ¨  Hadoop integration

Samza Architecture (Physical view) Samza container 1 Samza container 2
Host 1 Host 2 Samza YARN AM Node manager Node manager Kafka Kafka

Map Reduce Map Reduce YARN AM Node manager Node manager
HDFS HDFS Host 1 Host 2 Samza Architecture: Equivalence to Map Reduce

Liquid: Unifying nearline and offline big data ...

Liquid: Unifying nearline and offline big data integration

More Decks by nehanarkhede

Other Decks in Technology

Featured

Transcript