PRINCIPLES AND PATTERNS FOR
STREAMING DATA ANALYSIS
Voxxed Days Ticino
Galder Zamarreño Arrizabalaga
@galderz
20th October 2018
Slide 2
Slide 2 text
@GALDERZ #INFINISPAN #VDT18
2
Since 2006
ENGINEER
@galderz
Community Lead and
Core Developer
INFINISPAN
CO-FOUNDER (2008)
OTIS
PAIR PROGRAMMING BUDDY
Slide 3
Slide 3 text
@GALDERZ #INFINISPAN #VDT18
3
DATA IS OVERWHELMING US
Delays can have a big impact
EXPONENTIAL DATA GROWTH
YEAR ON YEAR
Smartphones, IOT devices, trillions of internet
connected devices...
REAL-TIME STREAMING DATA
PROCESSING IS CHALLENGING
Slide 4
Slide 4 text
@GALDERZ #INFINISPAN #VDT18
4
HIGH LEVEL ARCHITECTURE
Slide 5
Slide 5 text
@GALDERZ #INFINISPAN #VDT18
5
COLLECTION TIER
Slide 6
Slide 6 text
@GALDERZ #INFINISPAN #VDT18
6
COMMON INTERACTION PATTERNS
@GALDERZ #INFINISPAN #VDT18
14
e.g. analysis tier being more processing-intensive
or analysis tier consuming messages in batches
Fast collection tier combined with slow analysis tier
WHY DECOUPLE?
@GALDERZ #INFINISPAN #VDT18
16
DELIVERY SEMANTICS
At-least-once : messages not lost but might be repeated
Exactly-once : messages not lost and consumed only once
At-most-once : messages might get lost
Slide 17
Slide 17 text
@GALDERZ #INFINISPAN #VDT18
17
BULLSHIT!
Slide 18
Slide 18 text
@GALDERZ #INFINISPAN #VDT18
18
• Guaranteed delivery vs guaranteed processing of message
• What if a subscriber consumes the message and then it crashes?
• Guaranteed delivery and processing requires application awareness and collaboration
• So subscriber can IDEMPOTENTLY process a message and know to which point it's processed it
• At this point you're capable of doing at least once
• Also requires consumer to acknowledge processing to publisher
EXACTLY-ONCE MISLEADING OR LIE!
Slide 19
Slide 19 text
@GALDERZ #INFINISPAN #VDT18
19
ANALYSIS TIER
Slide 20
Slide 20 text
@GALDERZ #INFINISPAN #VDT18
20
IN-FLIGHT ANALYSIS
Traditional RDMS : data at rest and query for answers
Streaming : data moved through the query
Data always in motion from message queue tier
Slide 21
Slide 21 text
@GALDERZ #INFINISPAN #VDT18
21
CONTINUOUS QUERY
New data that matches query pushed to client
Use cases : tracking behaviour, traffic/safety, fraud analytics...
Query constantly evaluated
Slide 22
Slide 22 text
@GALDERZ #INFINISPAN #VDT18
22
SLIDING WINDOW
e.g. traffic information in my area for last hour
Combines queries with time constraints
Slide 23
Slide 23 text
@GALDERZ #INFINISPAN #VDT18
23
DATA ACCESS TIER
Slide 24
Slide 24 text
@GALDERZ #INFINISPAN #VDT18
24
PROTOCOLS TO SEND DATA TO CLIENTS
Protocol
Message
frequency
Communication
direction
Message
latency
Efficiency Fault tolerance / Reliability
Webhooks Low
Uni-directional
(server to client)
Average Low None
HTTP Long
Polling
Average Bi-directional Average Average None
Server-sent
events
High Uni-directional Low High
None by default. Can be
implemented.
WebSocket
s
High Bi-directional Low High
None by default. Can be
implemented.
@GALDERZ #INFINISPAN #VDT18
26
Platform-as-a-Service (PaaS)
Platform for developing and running
applications
Public or private and multi-language
OpenShift is a Kubernetes distro with extras
THE PLATFORM
@GALDERZ #INFINISPAN #VDT18
28
Vert.x is a toolkit for building reactive apps
On JVM, event-driven and non-blocking
RxJava integrates with Vert.x
Great at event transform and coordination
Works best with many source of events (modern apps!)
THE GLUE
@GALDERZ #INFINISPAN #VDT18
32
COMPONENT ARCHITECTURE
datagrid
infinispan
pod
infinispan
pod
infinispan
pod
datagrid-hotrod
service
/eventbus/delayed-trains
delayed trains
/eventbus/delayed-positions
app
main http
vert.x verticle pod
station boards
vert.x verticle
train positions
vert.x verticle
delayed positions