Stream Processing for the Serverless Generation
Ben Stopford
Office of the CTO, Confluent
Slide 2
Slide 2 text
When it comes to
data, we tend to
think in databases
Slide 3
Slide 3 text
Increasing Complexity
Apps
Monitoring
Security
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
Monitoring
Security
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
Monitoring
Security
Apps Apps
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App
Apps
Search
NoSQL
Mon
Sec
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
Apps App
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
S T R E A M I N
Apps
Search
NoSQL
Apps
DWH
S T R E A M I N G P L AT
App
Apps
Search
NoSQL
Apps
S T R E A M I N G P L AT
Apps
Search
NoSQL
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
S T
Apps
Search
NoSQL
DWH
S T R E
App
Apps
Apps
Search
Apps
App
Apps Apps
Apps
Search
Apps Apps
App Apps
Apps
Search
App
Kafka
Evolution of software systems
Monolith Distributed Monolith Microservices
Event-Driven
Microservices
Slide 4
Slide 4 text
Apps
Search
NoSQL
Mo
Se
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App Apps
Search
NoSQL
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
NoSQL
Monitoring
Security
Apps Apps
DWH
Hadoop
S T R E A M I N G P L AT F O R M
App
Apps Apps Apps
Apps Apps Apps
App
Apps Apps Apps
Apps Apps Apps
App Apps Apps Apps
Apps Apps Apps
App Apps Apps Ap
Apps Apps Apps
App Apps Apps
Apps Apps Apps
App
Apps Apps Apps
Apps Apps Apps
App
Apps Apps Apps
Apps Apps Apps
App
Backend services grow faster than the front end
Tightly
coupled
Loosely
coupled
e.g. Netflix have ~400
backend microservices
fed by Kafka
Slide 5
Slide 5 text
In the serverless world, which is inherently
event driven, stream processors will
become as important as databases are
today.
Slide 6
Slide 6 text
Apps Apps Apps
Apps
Search Monitoring
Apps Apps
Apps Apps Apps
Apps
Search Monitoring
Apps Apps
Apps
Search
NoSQL
Apps
Apps
DWH
Hado
S T R E A M
I N G
P L AT F O R M
Apps
Search
NoSQL
Apps
DWH
S T R E A M I N G
P L AT F O R M
PRODUCER
CONSUMER
Streaming Platform
Slide 7
Slide 7 text
Event Storage
Kafka stores
petabytes of data
Stream Processing
Real-time processing
over streams and tables
Scalability
Clusters of hundreds
of machines. Global.
+ + +
Roots in big data messaging
Slide 8
Slide 8 text
Are Serverless Functions and
Stream Processors related?
They are both functions we define that are triggered by
streams of events.
Slide 9
Slide 9 text
FaaS in brief
• Write a function
• Upload
• Configure a trigger (HTTP, Event, Object Store, Database, Timer etc.)
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
FaaS in a Nutshell
• Fully managed (Runs in a container)
• Pay as you use
• Auto-scales with load ~ 0-1000 concurrent functions
• Short lived (max ~5 mins)
• Weak ordering guarantees
• Cold start’s can be slow: 100ms – 45s (AWS 250ms-7s)
Slide 12
Slide 12 text
Where is FaaS useful?
• Interesting for spikey workloads (i.e. Extremities)
• Grid compute: HPC, Genomics, Finance
• Interesting for use cases that wouldn’t typically warrant the
cost of conventional massive parallelism e.g. CI systems.
• Serverless programming model
Slide 13
Slide 13 text
But there are open questions
Slide 14
Slide 14 text
Serverless Developer Ecosystem
• Runtime diagnostics
• Monitoring
• Deploy loop
• Testing
• IDE integration
Currently quite poor
Slide 15
Slide 15 text
Harder than current approaches Easier than current approaches
Amazon
Google
Microsoft
Slide 16
Slide 16 text
We’ll come back to this one!
Slide 17
Slide 17 text
FaaS is event-driven
But it isn’t streaming
Slide 18
Slide 18 text
Simple online retail example
When the order is created
..and the payment has been completed
=> get the customer’s info and send them an email confirming the
purchase.
Slide 19
Slide 19 text
Serverless Way: event-driven (not streaming)
Orders
Customers
Payments
FaaS
FaaS
FaaS
- Too slow for high velocity use cases ~ 5-10 messages per second
- Correctness: what if the payment isn’t there when the order arrives?
All Customer data
All Payment data
Slide 20
Slide 20 text
Process
boundary
Orders
Payments
KStreams
Customers
Table
Customers
Event-Streaming Platforms sew these operations together
Stateful or Stateless
• No network calls.
• 50,000-100,000 messages per second, per
thread
• Better correctness
Slide 21
Slide 21 text
Event Driven vs Stream Processing
Slide 22
Slide 22 text
Stream processors can be considered the
databases of the event driven world
Slide 23
Slide 23 text
A little detail
Slide 24
Slide 24 text
Three key features
•Stream-stream join (combine in real-time)
• Unlike a database join you only need consider how late
data might be
•Stream-table join (enrich)
• More like a database join on one side only.
•Aggregate (summarize)
• Big data sets are too large
Slide 25
Slide 25 text
Join events that happened recently
Stream-Stream Join
Slide 26
Slide 26 text
KSQL
Joining two streams
orders.join(payments)
Bob’s
Order
Bob’s
Payment
Jill’s
Payment
Jill’s
Order
Orders
Payments
Slide 27
Slide 27 text
KSQL
Join is on the key (messages have keys in Kafka)
orders.join(payments)
Bob’s
Order
Bob’s
Payment
Jill’s
Payment
Jill’s
Order
Slide 28
Slide 28 text
KSQL
Joining Two Streams: Streaming systems
doesn’t know when data is going to arrive
orders.join(payments)
Bob’s
Order
Bob’s
Payment
Jill’s
Payment
Jill’s
Order
Slide 29
Slide 29 text
KSQL
Bob’s payment arrives – nothing to join with
orders.join(payments)
Bob’s
Payment
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Slide 30
Slide 30 text
KSQL
Message gets buffered
Key-value store
Bob’s
Order
Bob’s
Payment
Jill’s
Payment
Jill’s
Order
Slide 31
Slide 31 text
KSQL
Jill’s order arrives and gets buffered
Key-value store
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 32
Slide 32 text
KSQL
Another non-matching record is buffered
Key-value store
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 33
Slide 33 text
KSQL
MATCH - based on key comparison
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 34
Slide 34 text
KSQL
MATCH – Create output event
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 35
Slide 35 text
KSQL
Continue
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 36
Slide 36 text
KSQL
2nd MATCH – Create another output event
Bob’s
Order
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Slide 37
Slide 37 text
KSQL
2nd MATCH – Create another output event
Jill’s
Payment
Jill’s
Order
Bob’s
Payment
Bob’s
Order
Slide 38
Slide 38 text
Enrichment of event stream
using a table
Stream-Table join
Slide 39
Slide 39 text
KSQL
Join a Stream with a Table
Customers
Orders
Query Cust1
Table of Customers
Slide 40
Slide 40 text
First we need some source: Database, Stream
Processor etc.
Apps
Search
NoSQL
Apps Apps
S T R E A M I N G P L AT F O R M
Customers
Data is now saved in a
topic in Kafka
Event Storage
Slide 41
Slide 41 text
KSQL
Kind of data virtualizartion
Table of Customers
1. Reload
2. Keep up to date
Apps
Search
NoSQL
Apps Apps
S T R E A M I N G P L AT F O R M
Customers
Event Storage
Slide 42
Slide 42 text
Summarizing Event Streams
Filters and Aggregations
Slide 43
Slide 43 text
Summarizing data streams (Event Sourced)
(Read/Write) KV
Store
Event Storage
Current state is
streamed to Kafka
KStreams
Payments
User-> Balance
Slide 44
Slide 44 text
Calc Balance
Payments
Event Storage
User Balance
Bob $500
Sally $250
George $32
Sum payments to get the user’s account balance
SELECT user, sum(amount)
FROM payments
GROUP BY user;
Slide 45
Slide 45 text
Summarizing data streams (Windowed)
(Read/Write) KV
Store
Event Storage
Current state is
streamed to Kafka
KStreams
Page Views
1 minute window
Page Views per min
Slide 46
Slide 46 text
Three key features
• The stream-stream join (combine in real-time)
• The stream-table join (enrich)
• The aggregate (summarize)
(lot’s more: transactions, chained operations, queryable state etc.)
Slide 47
Slide 47 text
The operations are stateful to different
degrees
• The stream-stream join (state ∝ buffer)
• The stream-table join (state ∝ table size)
• The aggregate (state ∝ cardinality of aggregation key)
Slide 48
Slide 48 text
Two modes of operation
Slide 49
Slide 49 text
API-based (Most common today)
orders
.filter((id, order) ->
order.state().equals(“CREATED”))
.join(payments)
.transform(MyEmailer::new, STORE)
.to(“sent-emails”)
JVM Only
Similar to Flink, Storm, Samza, Spark etc.
Slide 50
Slide 50 text
Process
boundary
Orders
Payments
KStreams
Customers
Table
Customers
Use the API
Business logic
Slide 51
Slide 51 text
Orders
Payments
KStreams
Customers
Table
Customers
Is mixing state and business logic a good idea?
Avoid Being too Stateful
Slide 52
Slide 52 text
Use KSQL
CREATE STREAM order-payments AS
SELECT * FROM orders, payments
LEFT JOIN orders ON orders.orderId = payments.orderId;
WHERE order.state = ‘CREATED’
Orders
Payments
KSQL
Customers
Table
Customers
Use KSQL Server (or Cloud Service)
My code
SELECT *
FROM orders, payments,
customers
WHERE …
Joined, enriched event
stream
Business
Logic
Slide 55
Slide 55 text
Stateless layer scales easily
KSQL
Stateful
Data Layer
Stateless
Application layer
(any language)
Scales quickly
Event Storage Denormalized
Events
Three event streams from
different event sources
Slide 56
Slide 56 text
Pattern Should be Familiar
Apps
Search
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search Monitorin
Apps Apps
S T R E A M I N G P L AT F O R M
Apps
Search
Apps
Apps
Search Monitor
Apps Apps
Stateful
Stateless
Slide 57
Slide 57 text
Comparing back to Serverless Functions
FaaS
- Autoscale based on demand
- Pay as you use
- Simple programming model
- Stateless
- High latency
- High throughput (if batched)
- One event source
Stream Processors
- Stateful & Stateless operations at
high throughputs.
- Join different event sources /enrich
- Correctness even after failure
- Rich semantics for dataflow
programming.
- Effectively infinite storage in Kafka
- Doesn’t autoscale (Scale manually /
programmatically)
- More complex
Slide 58
Slide 58 text
FaaS
FaaS
FaaS
Transaction
KSQL
Customers
Table
They compliment: Stream processors act as a “data layer”
for FaaS
FaaS
FaaS
Stateless
Stateful
Orders
Payments
Customers
AWS Lambda
Connector
Slide 59
Slide 59 text
Broader pattern
(easier to consume, keep apps stateless)
Orders
Service
Payment
Service
Customer
Service
Denormalized Events
Apps Apps
Apps
Search
NoSQL
Apps
S T R E A M I N G P L AT F
NoSQL
Order
Payment
Customer
Most
languages
supported
Denormalized Events
Slide 60
Slide 60 text
Event Streams
Orders
Payments
Customers
Distinct Visits
Destination
C*
Postgres
Lambda
Other Kafka
Select Organizational Events
Stream Processing
SELECT *
FROM ORDERS O, CUSTOMERS C
WHERE O.REGION = ‘EU’
AND C.TYPE = ‘Platinum’
Msgs/Day
Customers
Stream Processing
C*
Lambda
Orders
History
1w
All
Event storage + Stream Processing make data self service
(real time & historical)
Slide 61
Slide 61 text
Steaming Platforms apply these patterns across ecosystems
Event Streaming Platform
(Storage + Stream Processing)
In Summary
• Stream processors can operate like databases for this event
driven world.
• FaaS is one of many “end points”
• FaaS has unique properties:
• Pay as you use
• Load driven autoscaling
• Programming model
•Trick: split application logic from data preparation &
adopt event-first model.
Slide 66
Slide 66 text
In this increasingly event driven
world stream processors become as
important as databases are today.
Slide 67
Slide 67 text
FaaS
CRUD Event-Driven
Application
Database
KSQL
Stateful
Data Layer
FaaS
FaaS
FaaS
FaaS
FaaS
Event-Streaming
Stateless
Stateless
Stateless
Compute Layer
Massive linear scalability with elasticity
Slide 68
Slide 68 text
No content
Slide 69
Slide 69 text
Streaming platforms provide a unique alternative.
Billing Shipping
Fraud Fraud
Fulfilment
Streaming Platform
Slide 70
Slide 70 text
Thank you
@benstopford
Book:
https://www.confluent.io/designing-event-driven-systems
Slide 71
Slide 71 text
Rate today’s session
Session page on oreillysacon.com/ny O’Reilly Events App