Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Event Streaming Rises

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for Lee Dongjin Lee Dongjin
January 21, 2020

Event Streaming Rises

Apache Kafka를 비롯한 분산 로그 기술의 배경에 있는 아이디어들에 대해서: 전통적인 아키텍처에는 어떠한 문제가 있고, Event Streaming이라는 패러다임의 도입은 어떻게 이 문제들을 해결할 수 있는가?
2020년 1월 21일, Furiosa AI 에서 발표.

About the ideas in the background of distributed log technology, e.g., Apache Kafka: the problems of traditional architectures & how event streaming (with Log as a primary programming model) approaches these problems?
Presented at Furiosa AI, a hardware startup developing a high-performance AI inference processor, January 21st, 2020.

Slides: Korean. Presentation: English.

Avatar for Lee Dongjin

Lee Dongjin

January 21, 2020

More Decks by Lee Dongjin

Other Decks in Technology

Transcript

  1. Contents • Background: Modern Enterprise Architecture ◦ Problems • Event

    Streaming ◦ Approaches ◦ Log • Related Projects • Conclusion
  2. Problems: Application Layer • Polyglot Persistence ◦ ACID Property ◦

    Atomicity or Abortability - but how? • Heterogeneous Distributed Transaction ◦ cf: Homogeneous Distributed Transaction ◦ ex: VoltDB, Google Spanner • Limitations of 2PC ◦ Blocking Protocol ◦ Not well supported Process
  3. Problems: Data Integration Layer • ETL: Extract, Transform, Load •

    Low coverage • Too slow ◦ Can’t handle real-time data! • Not scalable ◦ Hard to Extract ◦ Repeated Load ◦ Complexity: O(m * n)
  4. Problems: Data Processing Layer • All problems in Data Integration

    Layer AND... • Processing Logic = DAG of Processors ◦ No reliable real-time input / output ◦ Complex to connect storages ▪ Schema: Easy to break ◦ Processor w/ Internal State ▪ How to recover on failure? • In short: A TOTAL MESS. Filter Enrich Agg
  5. Event Streaming • Storage as a Mutable State ◦ …

    of final snapshot ◦ Bound collection of records • Storage as a Sequence of Immutable Events ◦ … being accumulated ◦ Unbound collection of immutable events
  6. Distributed Transaction w/ Event Streaming (1) • Reactive approach (detects

    events from & emits events to event store) Process Event Store Storage A Storage B Storage C
  7. Distributed Transaction w/ Event Streaming (2) • Process: emits transaction

    request Process Event Store Storage A Storage B Storage C TX 1
  8. Distributed Transaction w/ Event Streaming (3) • Storage A: detects

    transaction request & emits success event on A Process Event Store Storage A Storage B Storage C TX 1 TX 1/A
  9. Distributed Transaction w/ Event Streaming (4) • Storage B: detects

    success event on A & emits success event on B Process Event Store Storage A Storage B Storage C TX 1 TX 1/A TX 1/B …
  10. Distributed Transaction w/ Event Streaming (5) • Storage C: detects

    success event on B & emits success event on C Process Storage A Storage B Storage C TX 1 TX 1/A TX 1/B … … … TX 1/C
  11. Distributed Transaction w/ Event Streaming (6) • Process: detects success

    event on C & emits transaction success Process Storage A Storage B Storage C TX 1 TX 1/A TX 1/B … … … TX 1/C TX 1 …
  12. Distributed Transaction w/ Event Streaming (7) • If failure: for

    example, Storage B emits failure event on B Process Storage A Storage B Storage C TX 1 TX 1/A TX 1/B … TX 1/C TX 1 … TX 2 TX 2/A TX 2/B
  13. Distributed Transaction w/ Event Streaming (8) • Process: detects failure

    event on A, B, or C and emits transaction failure Process Storage A Storage B Storage C TX 1 TX 1/A TX 1/B … TX 1/C TX 1 TX 2 TX 2/A TX 2/B TX 2
  14. Distributed Transaction w/ Event Streaming (9) • Storage A, B

    or C: detects transaction failure, rollbacks transaction Process Storage A Storage B Storage C TX 1 TX 1/A TX 1/B … TX 1/C TX 1 TX 2 TX 2/A TX 2/B TX 2
  15. Conditions for Event Store • Optimized to write • Ordering

    guarantee ◦ Read events in stored order • Subscribable • Scalability & Fault tolerance Event Store
  16. Log: A data structure (1) • Append-only Data Structure ◦

    Log Sequence No. (=id), Payload ◦ No update or delete • Provides… ◦ Optimized to write ◦ Ordering guarantee ◦ Logical timestamp ▪ Independent to physical time
  17. Log: A data structure (2) • Not seen, but Everywhere

    ◦ At the heart of the data system • From Traditional Databases to Distributed Systems ◦ A Tool to achieve Atomicity and Durability ▪ Write Ahead Log (WAL), Commit Log, Transaction Log, Replication Log, ... ◦ A Storage itself ▪ Log Structured Merge tree ▪ LevelDB (by Google), RocksDB (by Facebook) ▪ Google BigTable, HBase HFile, Cassandra SSTable
  18. Log as a Distributed Storage (1) • Log as an

    application programming model ◦ Append Only ◦ Sequential Read w/ Subscribing ◦ Durable • Log as a Distributed Storage ◦ Scalability by Partitioning ▪ Partition Count = Maximum Parallelism ▪ Ordering guarantee only in Partition ▪ Key for partitioning ◦ Fault Tolerance by Replication ▪ Maintains multiple copies of each partition
  19. Log as a Distributed Storage (2) • Implementations ◦ Apache

    Kafka (by Linkedin) ◦ Apache Bookeeper (by Yahoo! Research) ◦ Apache Pulsar ◦ CORFU (by Microsoft Research) ◦ LogDevice (by Facebook) ◦ AWS Kinesis, GCP Pub/Sub, Azure Event Hub
  20. OLEP: OnLine Event Processing • Advantages ◦ Scalable Distributed Transaction

    • Limitations ◦ Eventual Consistency ◦ Isolation Level ▪ Read Committed only ▪ Snapshot Isolation is under research P1 P2 P3 t1 t2
  21. What type of Streams? • Event Stream ◦ Accumulates events

    ◦ A way to communicate between distributed services ▪ Mainly microservices ◦ Key for partition assignment (optional) • Update Stream (AKA Changelog stream) ◦ Represents modification history of a state ◦ A way to represent the logical state of a storage ▪ Including (but not restricted to) databases ◦ Key for record identity (required) Update Stream Event Stream
  22. Update Stream (AKA Changelog Stream) (1) • Example Key Value

    k1 v1 Key Value k1 v1 k2 v2 Key Value k1 v1-1 k2 v2 Key Value k1 v1-1 (k2, v2) (k1, v1-1) (k2, null) State Update Stream
  23. Update Stream (AKA Changelog Stream) (2) • Stream-Table Duality ◦

    Table → Stream ▪ All states (including storages) can be represented as a form of update stream ◦ Stream → Table ▪ … and Vice Versa ▪ State as a Materialized View w/ timestamp • Examples ◦ Replication Log ◦ CDC (Change Data Capture) feature like Oracle GoldenGate
  24. Data Integration with Update Stream (1) • Example: replicating MySQL

    Mysql (original) Mysql (replicated) Mysql (w/ Index) BigQuery Changelog
  25. Data Integration with Update Stream (2) • Extract once ◦

    w/ Predefined Schema ◦ w/ Cleansing • Transform as many times you need • Load per storage type • Advantages: Scalable & Real-time! Original Storage Changelog Stream Derived Stream Destination Storages
  26. • In event-streaming world, the schema of events stored in

    Log works as a contract between different microservices or teams. ◦ … dislike to RPC world, where API works as a Contract in RPC. • Each devops team that publishes event stream takes the responsibility of maintaining the stream Clean and Forward-compatible. • Example: A dedicated service to register, document, evolve, and validate schemas ◦ https://issues.apache.org/jira/browse/AVRO-1124 Schema as a Contract
  27. Machine Replication Principle • “Updating state deterministically based on an

    ordered events” • “A processor with internal state can be serialized into changelog, and vice versa.” Processor w/ State
  28. Real-time Data Processing Model w/ Log • Data Processing Task

    ◦ A Dag of Processors ◦ Read from Log, Write to Log • Processor ◦ Stateless Processor ◦ Stateful Processor ▪ Fault Tolerance w/ Internal State Changelog ▪ Crash: Can recover from Changelog! Filter Enrich Aggre gation
  29. Related Projects: Extract and Load (1) • Kafka Connect ◦

    De-facto Standard ◦ Supports both of Table → Stream and Stream → Table ◦ Provides ‘Extract and Load’ functionality (w/ little Transformation) • Brooklin (by Linkedin) ◦ Provides a generic bridge between heterogeneous Streaming / Storage Systems ▪ Table → Table, Table → Stream, Stream → Table, Stream → Stream ◦ Deployed as a dedicated system ◦ Open-sourced in 2019
  30. Related Projects: Extract and Load (2) • Debezium ◦ Focuses

    CDC only (Table → Stream) ◦ Supports MySQL, PostgreSQL, Oracle, SQL Server, MongoDB, Cassandra, ... ◦ Deployed as Kafka Connect plugin or Embedded Engine • DBLog (by Netflix) ◦ Focuses CDC only (Table → Stream) ◦ Supports MySQL and PostgreSQL ◦ Deployed as a dedicated system ◦ Planning to be open-sourced in 2020
  31. Related Projects: Processing Framework • Real-time MapReduce ◦ Spark ◦

    Flink • Event-Driven Microservices ◦ Kafka Streams ◦ ksqlDB ▪ “Streaming application as a Query”
  32. Conclusion (1) • Event streaming is a paradigm shift in

    storing & processing data. • Log is a primary programming model for Event streaming.
  33. Conclusion (2) • Log as a primary programming model ◦

    … provides an excellent abstraction for a wide range of problems • Application ◦ Distributed Heterogeneous Transaction • Data Integration ◦ Scalable ETL w/ great coverage in real-time • Data Processing ◦ Stateful processor w/ fault tolerance
  34. Conclusion (3) • Traditional concepts in enterprise application are changing

    ◦ No boundaries, No orders anymore ◦ “All processes are just a kind of processor w/ state store interconnected via Log.”
  35. … And one more thing (1) Distributed Application Database Distributed

    Log Log Changelog stream → State Table updates → Materialized View State per access pattern Table with Index State to Changelog Stream Table to Transaction Log Log to Log processing WAL to Transaction Log processing
  36. … And one more thing (2) “Any sufficiently advanced distributed

    application is indistinguishable from database.”
  37. References (1) • Jay Kreps, ‘The Log: What every software

    engineer should know about real-time data's unifying abstraction’, Linkedin Engineering, 2013 ◦ https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should- know-about-real-time-datas-unifying • Adam Jacobs, ‘The Pathologies of Big Data’, acmqueue volume 7, issue 6, 2009 ◦ https://queue.acm.org/detail.cfm?id=1563874 • Martin Kleppmann et al, ‘Online Event Processing’, acmqueue volume 17, issue 1, 2019 ◦ https://queue.acm.org/detail.cfm?id=3321612 • O’Neil et al, ‘The log-structured merge-tree (LSM-tree)’, Acta Informatica volume 33, 1996 ◦ https://dl.acm.org/doi/10.1007/s002360050048
  38. References (2) • Jeffrey Dean et al, ‘Bigtable: A Distributed

    Storage System for Structured Data’, Usenix, 2006 ◦ https://research.google/pubs/pub27898/ • Jay Kreps, ‘I ♥ Logs: Apache Kafka, Stream Processing, and Real-time Data’, Stanford Seminar, 2014 ◦ https://www.youtube.com/watch?v=SU8LaHLh6Ng • Jay Kreps, ‘Questioning Lambda Architecture’, oreilly.com, 2014 ◦ https://www.oreilly.com/radar/questioning-the-lambda-architecture/ • Matthias J. Sax et al, 'Streams and Tables: Two Sides of the Same Coin', Proceedings of the international workshop on Real-time business intelligence and Analytics, 2018 ◦ https://dl.acm.org/doi/10.1145/3242153.3242155
  39. References (3) • Martin Kleppmann and Jay Kreps, ‘Kafka, Samza

    and the Unix philosophy of distributed data’, IEEE Data Engineering Bulletin 38, 2015 ◦ https://martin.kleppmann.com/2015/12/15/stream-processing-data-engineering-bulletin.html • Haruna Isah et al, ‘A Survey of Distributed Data Stream Processing Frameworks’, IEEE Access volume 7, 2019 ◦ https://ieeexplore.ieee.org/document/8864052 • Samarth Shetty, ‘Streaming Data Pipelines with Brooklin’, Linkedin Engineering, 2017 ◦ https://engineering.linkedin.com/blog/2017/10/streaming-data-pipelines-with-brooklin • Celia Kung, ‘Open Sourcing Brooklin: Near Real-Time Data Streaming at Scale’, Linkedin Engineering, 2019 ◦ https://engineering.linkedin.com/blog/2019/brooklin-open-source
  40. References (4) • Andreas Andreakis and Ioannis Papapanagiotou, ‘DBLog: A

    Generic Change-Data-Capture Framework’, Netflix Technology, 2019 ◦ https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b • Jay Kreps, ‘Introducing ksqlDB’, Confluent, 2019 ◦ https://ksqldb.io/