Slide 1

Slide 1 text

Title of class/project By Juan Martín Pampliega Implementing Stream Processing Systems Juan Martín Pampliega

Slide 2

Slide 2 text

Juan Pampliega Information Engineering @ ITBA Co-Founder @ Mutt Data Professor @ ITBA - Especialización en Ciencia de Datos ● Working in Data Projects since 2010 at ITBA, Globant (Google), Despegar, Socialmetrix, Jampp, Claro, etc. ● Co Founder @ Mutt Data a company specialized in developing projects using Big Data and Data Science. ● Developed first production ready stream processing system in 2015 @juanpampliega | [email protected]

Slide 3

Slide 3 text

Contents Why Stream Processing? Fundamental Concepts Technology Landscape Summary Implementing Stream Processing Systems 01 02 03 04

Slide 4

Slide 4 text

01 Why Stream Processing ?

Slide 5

Slide 5 text

Event-Command vs Event-Driven For organizations to become real time, they must become event driven In Event-Command pattern (like REST applications) the endpoint is known, the method being called is also known and lastly the calls tended to return a value. In Event-Driven pattern services communicate only by generating events that can be reused by any service in the system which leads to less coupling.

Slide 6

Slide 6 text

Service Oriented

Slide 7

Slide 7 text

Event Driven: Unified Log

Slide 8

Slide 8 text

Event Streams Data is generally born as continuous event streams Batch processing => bounded datasets Stream processing => unbounded datasets Stream processing means computing on data directly as it is produced or received.

Slide 9

Slide 9 text

Batch Processing Traditional data processing storing streams of data in databases and later processing it using ETLs. Still needed when you need to explore data an when you are not yet sure what you want to do with it.

Slide 10

Slide 10 text

Stream Processing In stream processing the application logic, analytics, and queries exist continuously, and data flows through them continuously.

Slide 11

Slide 11 text

To produce results in real time a system must continuously compute and update results with each new event. Modern applications and microservices should operate in an event-driven fashion. Their logic and computation is triggered by events. Unified event-driven applications and real-time analytics

Slide 12

Slide 12 text

02 Fundamental Concepts

Slide 13

Slide 13 text

Operations over Streams FILTER JOIN

Slide 14

Slide 14 text

Stateful Streaming Processing Stateful stream processing is a subset of stream processing in which the computation maintains contextual state. This state is used to store information derived from the previously-seen events.

Slide 15

Slide 15 text

Windows Partition a stream of data into discrete batches. Needed to compute metrics that require context (average by minute, partial counts, etc.)

Slide 16

Slide 16 text

Event Time vs Processing Time ● Processing and transport technologies. ● Characteristics of the data source. (distribution, throughput, burstiness, etc.) ● Hardware. (network, RAM, CPUs, etc.) ● FAILURES & RETRIES

Slide 17

Slide 17 text

Event Time vs Processing Time

Slide 18

Slide 18 text

Operations over Streams PROCESSING TIME EVENT TIME

Slide 19

Slide 19 text

Streaming State and Windows

Slide 20

Slide 20 text

Fault Tolerant Local State Local state is a fundamental primitive in stream processing. It can be indexed and accessed in a variety of rich ways. Local, in-process data access is much faster. It’s easier to isolate. Implemented with in memory hash table, bloom filters, bit maps, RocksDB like systems, etc.

Slide 21

Slide 21 text

Critical Questions in Stream Processing Tyler Akidau (tech lead for internal streaming data processing systems @ Google) defined 4 critical questions any stream processing system should be able to answer: What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

Slide 22

Slide 22 text

Watermarks and Triggers

Slide 23

Slide 23 text

03 Technology Landscape

Slide 24

Slide 24 text

Ververica Platform

Slide 25

Slide 25 text

Apache Flink

Slide 26

Slide 26 text

Companies that use Apache Flink

Slide 27

Slide 27 text

Apache Beam Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends ○ Apache Apex ○ Apache Flink ○ Apache Spark ○ Google Cloud Dataflow ○ Local (in-process) runner for testing

Slide 28

Slide 28 text

Google Cloud Dataflow

Slide 29

Slide 29 text

Spark Streaming Platform

Slide 30

Slide 30 text

Confluent Platform

Slide 31

Slide 31 text

04 Resources

Slide 32

Slide 32 text

Ben Stopford, The Data Dichotomy: Rethinking the Way We Treat Data and Services https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and- services/ Tyler Akidau, Streaming 101: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Tyler Akidau, Streaming 102: The world beyond batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Ververica, What is Stream Processing? https://www.ververica.com/what-is-stream-processing Apache Flink Documentation https://ci.apache.org/projects/flink/flink-docs-release-1.8/ Apache Beam https://beam.apache.org/

Slide 33

Slide 33 text

Title of class/project By Juan Martín Pampliega Thank you ! Implementing Stream Processing Systems By Juan Martín Pampliega We are hiring Data Engineers! [email protected]