Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

1 Unified Processing at Scale with Apache Samza Jake Maes
Staff SW Engineer at LinkedIn Apache Samza PMC

2 About Me • Apache Samza PMC member • LinkedIn
3 years • 8 years performance & infra development • Passionate about scale • Long walks on the peaks

3 Agenda Intro to Stream Processing Stream Processing Ecosystem at
LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future

5 About • Production at LinkedIn since 2014 • Apache
top level project since 2014 • 16 Committers • 74 Contributors • Known for  Scale  Pluggability  Kafka integration

6 • Low latency • One message at a time
• Checkpointing, state, durability • All I/O with high-performance message brokers Traditional Stream Processing

7 Stateful Processing Task Task0 State0 Changelog Stream (partition 0)
Checkpoint Stream Processor Output Streams Input Streams (partition 0)

8 Co-Partitioned Streams

9 Typical Flow - Two Stages Minimum Re- partitio n
windo w ma p sendT o PageVie w Event PageViewEven t ByMemberId PageViewEventP er MemberStream PageViewRepartitionTask PageViewByMemberIdCounterTask

11 Stream Processing Ecosystem – The Dream Applications and Services
Samz a Kafka Storag e Externa l Stream s Storage & Serving Brooklin

12 Stream Processing Ecosystem - Reality Applications and Services Samz
a Kafka Storag e Externa l Stream s Storage & Serving Brooklin

13 Expansion of Stream Processing at LinkedIn • Influx of
new applications  10 -> over 200 • New use cases  Batch  Streaming  Remote I/O  Composable API • Incoming applications have different expectations • Let’s take a look at two Services

15 Online Service + Stream Processing Requirements: • Deployment model
 Cluster environment not suitable • Remote I/O  Dependencies on other services  I/O latency stalls single threaded processor  Container parallelism - too much overhead Services

16 App Instance Embedded Samza • Zookeeper-based JobCoordinator  Uses
Zookeeper for leader election  Leader assigns work to the processors ZooKeeper ZooKeeper Stream Processor Samza Container Job Coordinato r* App Instance Stream Processor Samza Container Job Coordinato r App Instance Stream Processor Samza Container Job Coordinato r * Leader

17 Asynchronous Event Loop Stream Processor Event Loop  Single
thread  1 : Task  n : Task Restful Services Java NIO, Netty

18 Checkpointing • Sync – Barrier • Async - Watermark
t1 t2 t3 tc t4 checkpoint callback 3 complet e time callback 1 complet e callback 2compl ete callback 4 complet e

19 Performance for Remote I/O Baseline Thread pool size =
10 Max concurrency = 1 Thread pool size = 10 Max concurrency = 3 Sync I/O with Multithreading Single thread

20 Case Study – Notification Scheduler Processor User Chat Event
User Action Event Connectio n Activity Event Restful Service s Member profile database Aggregatio n Engine Channel Selection State store input1 input2 input3 ① Local Data Access ② Remote Database Lookup ③ Remote Service Call outp ut

22 Offline Jobs Requirements: • Performance and low latency •
Resource hungry  Finite jobs can hog resources  Infinite jobs need to be better citizens • Composable API • Same app in batch and streaming  Best of both worlds • HDFS I/O

23 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask,
WindowableTask { private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } }

24 High Level Logic public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pve = graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m); OutputStream<String, MyOutputType, MyOutputType> mpv = graph .getOutputStream("memberPageViews", m -> m.memberId, m -> m); pve .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0, (m, c) -> c + 1)) .map(MyOutputType::new) .sendTo(mpv); } } Built-in transform functions

25 High Level API - Composable Operators filter select a
subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams Stateless Functions I/O Function s Stateful Functions

26 Batch AND Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs
streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEven t/ streams.memberPageViews.system=hdfs streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPage Views Streaming config Batch config

27 Case Study - Unified Metrics with Samza UMP Analyst
Pig Script “Compile” Author Generate Fluent Code + Runtime Config Deploy + +

28 Performance - HDFS • Profile count, group by country
• 500 files • 250GB

30 What’s Next? • SQL  Prototyped 2015  Now
getting full time attention • High Level API extensions  Better config, I/O, windowing, and more • Beam Runner  Samza performance with Beam API • Table support

31 Questions Contact: • Email: [email protected] • Social: http://twitter.com/jakemaes Links:
• http://samza.apache.org • http://github.com/apache/samza • https://engineering.linkedin.com/blog

Unified Stream Processing at Scale with Apache ...

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

1 Unified Processing at Scale with Apache Samza Jake Maes

2 About Me • Apache Samza PMC member • LinkedIn

3 Agenda Intro to Stream Processing Stream Processing Ecosystem at

4 Agenda Intro to Stream Processing Stream Processing Ecosystem at

5 About • Production at LinkedIn since 2014 • Apache

6 • Low latency • One message at a time

7 Stateful Processing Task Task0 State0 Changelog Stream (partition 0)

8 Co-Partitioned Streams

9 Typical Flow - Two Stages Minimum Re- partitio n

10 Agenda Intro to Stream Processing Stream Processing Ecosystem at

11 Stream Processing Ecosystem – The Dream Applications and Services

12 Stream Processing Ecosystem - Reality Applications and Services Samz

13 Expansion of Stream Processing at LinkedIn • Influx of

14 Agenda Intro to Stream Processing Stream Processing Ecosystem at

15 Online Service + Stream Processing Requirements: • Deployment model

16 App Instance Embedded Samza • Zookeeper-based JobCoordinator  Uses

17 Asynchronous Event Loop Stream Processor Event Loop  Single

18 Checkpointing • Sync – Barrier • Async - Watermark

19 Performance for Remote I/O Baseline Thread pool size =

20 Case Study – Notification Scheduler Processor User Chat Event

21 Agenda Intro to Stream Processing Stream Processing Ecosystem at

22 Offline Jobs Requirements: • Performance and low latency •

23 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask,

24 High Level Logic public class RepartitionAndCounterExample implements StreamApplication {

25 High Level API - Composable Operators filter select a

26 Batch AND Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs

27 Case Study - Unified Metrics with Samza UMP Analyst

28 Performance - HDFS • Profile count, group by country

29 Agenda Intro to Stream Processing Stream Processing Ecosystem at

30 What’s Next? • SQL  Prototyped 2015  Now

31 Questions Contact: • Email: [email protected] • Social: http://twitter.com/jakemaes Links: