Title of
class/project
By Juan Martín Pampliega
Implementing Stream
Processing Systems Juan Martín
Pampliega
Slide 2
Slide 2 text
Juan Pampliega
Information Engineering @ ITBA
Co-Founder @ Mutt Data
Professor @ ITBA - Especialización en Ciencia de Datos
● Working in Data Projects since 2010 at ITBA, Globant (Google),
Despegar, Socialmetrix, Jampp, Claro, etc.
● Co Founder @ Mutt Data a company specialized in developing
projects using Big Data and Data Science.
● Developed first production ready stream processing system in 2015
@juanpampliega | [email protected]
Slide 3
Slide 3 text
Contents
Why Stream Processing?
Fundamental Concepts
Technology Landscape
Summary
Implementing
Stream Processing
Systems
01
02
03
04
Slide 4
Slide 4 text
01 Why Stream
Processing ?
Slide 5
Slide 5 text
Event-Command vs Event-Driven
For organizations to become real time, they must become event driven
In Event-Command pattern (like REST
applications) the endpoint is known, the
method being called is also known and lastly
the calls tended to return a value.
In Event-Driven pattern services communicate
only by generating events that can be reused by
any service in the system which leads to less
coupling.
Slide 6
Slide 6 text
Service Oriented
Slide 7
Slide 7 text
Event Driven: Unified Log
Slide 8
Slide 8 text
Event Streams
Data is generally born as continuous
event streams
Batch processing => bounded datasets
Stream processing => unbounded datasets
Stream processing means computing on
data directly as it is produced or received.
Slide 9
Slide 9 text
Batch Processing
Traditional data processing storing
streams of data in databases and later
processing it using ETLs.
Still needed when you need to
explore data an when you are
not yet sure what you want to
do with it.
Slide 10
Slide 10 text
Stream Processing
In stream processing the application logic, analytics, and queries exist continuously, and
data flows through them continuously.
Slide 11
Slide 11 text
To produce results in real time a
system must continuously
compute and
update results with each new
event.
Modern applications and
microservices should operate in
an event-driven fashion.
Their logic and computation is
triggered by events.
Unified event-driven
applications and
real-time
analytics
Slide 12
Slide 12 text
02 Fundamental
Concepts
Slide 13
Slide 13 text
Operations over Streams
FILTER
JOIN
Slide 14
Slide 14 text
Stateful Streaming Processing
Stateful stream processing is a subset of stream processing in which the
computation maintains contextual state.
This state is used to store information derived from the previously-seen events.
Slide 15
Slide 15 text
Windows
Partition a stream of data into discrete batches.
Needed to compute metrics that require context (average by minute, partial
counts, etc.)
Slide 16
Slide 16 text
Event Time
vs
Processing Time
● Processing and transport
technologies.
● Characteristics of the
data source. (distribution,
throughput, burstiness,
etc.)
● Hardware. (network,
RAM, CPUs, etc.)
● FAILURES & RETRIES
Slide 17
Slide 17 text
Event Time vs Processing Time
Slide 18
Slide 18 text
Operations over Streams
PROCESSING TIME
EVENT TIME
Slide 19
Slide 19 text
Streaming State and Windows
Slide 20
Slide 20 text
Fault Tolerant Local State
Local state is a fundamental
primitive in stream processing.
It can be indexed and accessed in a
variety of rich ways.
Local, in-process data access is much
faster.
It’s easier to isolate.
Implemented with in memory hash
table, bloom filters, bit maps,
RocksDB like systems, etc.
Slide 21
Slide 21 text
Critical Questions in Stream Processing
Tyler Akidau (tech lead for internal streaming data processing systems @
Google) defined 4 critical questions any stream processing system should be
able to answer:
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
Slide 22
Slide 22 text
Watermarks and Triggers
Slide 23
Slide 23 text
03 Technology
Landscape
Slide 24
Slide 24 text
Ververica Platform
Slide 25
Slide 25 text
Apache Flink
Slide 26
Slide 26 text
Companies that use Apache Flink
Slide 27
Slide 27 text
Apache Beam
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
Languages
Beam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing
Backends
○ Apache Apex
○ Apache Flink
○ Apache Spark
○ Google Cloud Dataflow
○ Local (in-process) runner for testing
Slide 28
Slide 28 text
Google Cloud Dataflow
Slide 29
Slide 29 text
Spark Streaming Platform
Slide 30
Slide 30 text
Confluent Platform
Slide 31
Slide 31 text
04 Resources
Slide 32
Slide 32 text
Ben Stopford, The Data Dichotomy: Rethinking the Way
We Treat Data and Services
https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-
services/
Tyler Akidau, Streaming 101: The world beyond batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Tyler Akidau, Streaming 102: The world beyond batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Ververica, What is Stream Processing?
https://www.ververica.com/what-is-stream-processing
Apache Flink Documentation
https://ci.apache.org/projects/flink/flink-docs-release-1.8/
Apache Beam
https://beam.apache.org/
Slide 33
Slide 33 text
Title of
class/project
By Juan Martín Pampliega
Thank you !
Implementing Stream
Processing Systems
By Juan Martín Pampliega
We are hiring Data Engineers!
[email protected]