Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.

https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 04, 2017
Tweet

Transcript

  1. None
  2. 1 Unified Processing at Scale with Apache Samza Jake Maes

    Staff SW Engineer at LinkedIn Apache Samza PMC
  3. 2 About Me • Apache Samza PMC member • LinkedIn

    3 years • 8 years performance & infra development • Passionate about scale • Long walks on the peaks
  4. 3 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  5. 4 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  6. 5 About • Production at LinkedIn since 2014 • Apache

    top level project since 2014 • 16 Committers • 74 Contributors • Known for  Scale  Pluggability  Kafka integration
  7. 6 • Low latency • One message at a time

    • Checkpointing, state, durability • All I/O with high-performance message brokers Traditional Stream Processing
  8. 7 Stateful Processing Task Task0 State0 Changelog Stream (partition 0)

    Checkpoint Stream Processor Output Streams Input Streams (partition 0)
  9. 8 Co-Partitioned Streams

  10. 9 Typical Flow - Two Stages Minimum Re- partitio n

    windo w ma p sendT o PageVie w Event PageViewEven t ByMemberId PageViewEventP er MemberStream PageViewRepartitionTask PageViewByMemberIdCounterTask
  11. 10 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  12. 11 Stream Processing Ecosystem – The Dream Applications and Services

    Samz a Kafka Storag e Externa l Stream s Storage & Serving Brooklin
  13. 12 Stream Processing Ecosystem - Reality Applications and Services Samz

    a Kafka Storag e Externa l Stream s Storage & Serving Brooklin
  14. 13 Expansion of Stream Processing at LinkedIn • Influx of

    new applications  10 -> over 200 • New use cases  Batch  Streaming  Remote I/O  Composable API • Incoming applications have different expectations • Let’s take a look at two Services
  15. 14 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  16. 15 Online Service + Stream Processing Requirements: • Deployment model

     Cluster environment not suitable • Remote I/O  Dependencies on other services  I/O latency stalls single threaded processor  Container parallelism - too much overhead Services
  17. 16 App Instance Embedded Samza • Zookeeper-based JobCoordinator  Uses

    Zookeeper for leader election  Leader assigns work to the processors ZooKeeper ZooKeeper Stream Processor Samza Container Job Coordinato r* App Instance Stream Processor Samza Container Job Coordinato r App Instance Stream Processor Samza Container Job Coordinato r * Leader
  18. 17 Asynchronous Event Loop Stream Processor Event Loop  Single

    thread  1 : Task  n : Task Restful Services Java NIO, Netty
  19. 18 Checkpointing • Sync – Barrier • Async - Watermark

    t1 t2 t3 tc t4 checkpoint callback 3 complet e time callback 1 complet e callback 2compl ete callback 4 complet e
  20. 19 Performance for Remote I/O Baseline Thread pool size =

    10 Max concurrency = 1 Thread pool size = 10 Max concurrency = 3 Sync I/O with Multithreading Single thread
  21. 20 Case Study – Notification Scheduler Processor User Chat Event

    User Action Event Connectio n Activity Event Restful Service s Member profile database Aggregatio n Engine Channel Selection State store input1 input2 input3 ① Local Data Access ② Remote Database Lookup ③ Remote Service Call outp ut
  22. 21 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  23. 22 Offline Jobs Requirements: • Performance and low latency •

    Resource hungry  Finite jobs can hog resources  Infinite jobs need to be better citizens • Composable API • Same app in batch and streaming  Best of both worlds • HDFS I/O
  24. 23 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask,

    WindowableTask { private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } }
  25. 24 High Level Logic public class RepartitionAndCounterExample implements StreamApplication {

    @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pve = graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m); OutputStream<String, MyOutputType, MyOutputType> mpv = graph .getOutputStream("memberPageViews", m -> m.memberId, m -> m); pve .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0, (m, c) -> c + 1)) .map(MyOutputType::new) .sendTo(mpv); } } Built-in transform functions
  26. 25 High Level API - Composable Operators filter select a

    subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams Stateless Functions I/O Function s Stateful Functions
  27. 26 Batch AND Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs

    streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEven t/ streams.memberPageViews.system=hdfs streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPage Views Streaming config Batch config
  28. 27 Case Study - Unified Metrics with Samza UMP Analyst

    Pig Script “Compile” Author Generate Fluent Code + Runtime Config Deploy + +
  29. 28 Performance - HDFS • Profile count, group by country

    • 500 files • 250GB
  30. 29 Agenda Intro to Stream Processing Stream Processing Ecosystem at

    LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  31. 30 What’s Next? • SQL  Prototyped 2015  Now

    getting full time attention • High Level API extensions  Better config, I/O, windowing, and more • Beam Runner  Samza performance with Beam API • Table support
  32. 31 Questions Contact: • Email: dev@samza.apache.org • Social: http://twitter.com/jakemaes Links:

    • http://samza.apache.org • http://github.com/apache/samza • https://engineering.linkedin.com/blog