I have a suspicion:
Data Store Software
is not social
Martin Scholl
@zeit_geist
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
Column
-Store
Slide 5
Slide 5 text
Column
-Store
Row-Store
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
•Notes
• are a fact. can get copied.
• consist of immutable &
absolute entities
• have a fixed beginning
and ending
• music sheet = music essence;
“Music’s NoSQL DB”
•Music
• Making is a process. You can
record but not copy music.
• flows with the rhythm
• lives by the interactions
• is uniquely determined in
space and time:
the Music’s context
Webserver
Webserver
Webserver
Webserver
API-Endpoint
API-Endpoint
API-Endpoint
API-Endpoint
Graph-DB
Graph-DB
Graph-DB
Flock-DB
just Data
}
Slide 19
Slide 19 text
Webserver
Webserver
Webserver
Webserver
API-Endpoint
API-Endpoint
API-Endpoint
API-Endpoint
Graph-DB
Graph-DB
Graph-DB
Flock-DB
context-freed
Data
}
just Data
}
Slide 20
Slide 20 text
•Data Stores
• store facts.
• facts are fix and absolute
• facts are uniquely determined
by key / ID
• Data Stores are the source
of “truth”
• contain what has happened.
Slide 21
Slide 21 text
•Notes
• are a fact. can get copied.
• consist of immutable &
absolute entities
• have a fixed beginning
and ending
• music sheet = music essence;
“Music’s NoSQL DB”
•Data Stores
• store facts.
• facts are fix and absolute
• facts are uniquely determined
by key / ID
• Data Stores are the source
of “truth”
• contain what has happened.
Slide 22
Slide 22 text
Lose Information w/
your fav. Data Store
• Data in a Data Store gets
de-contextualized.
• You don’t get to know the origin of
data but just the fact itself.
• irrecoverable information loss!
• There is a severe social impedance
mismatch
Slide 23
Slide 23 text
Lose Information w/
your fav. Data Store
• Data in a Data Store gets
de-contextualized.
• You don’t get to know the origin of
data but just the fact itself.
• irrecoverable information loss!
• There is a severe social impedance
mismatch
Slide 24
Slide 24 text
How can we fix the
social impedance
mismatch?
Slide 25
Slide 25 text
Webserver
Webserver
Webserver
Webserver
API-Endpoint
API-Endpoint
API-Endpoint
API-Endpoint
Graph-DB
Graph-DB
Graph-DB
Flock-DB Data
}
Data +
Context
}
Slide 26
Slide 26 text
Webserver
Webserver
Webserver
Webserver
API-Endpoint
API-Endpoint
API-Endpoint
API-Endpoint
Graph-DB
Graph-DB
Graph-DB
Flock-DB Data
}
Data +
Context
}
Data +
Context Logic
}
Context-
Engine
Context-
Engine
Context-
Engine
Slide 27
Slide 27 text
Context Engine
Requirements
• must have a flexible programming model
• must be scalable and resilient
• must be able to integrate and process data
from high velocity data sources
Slide 28
Slide 28 text
Nathan Marz’s
Storm
• has a flexible programming model
• is scalable and resilient
• integrates and processes data from high
velocity data sources
Slide 29
Slide 29 text
Nathan Marz’s
Storm
• implemented in Clojure + Java
• was Backtype proprietary
• OpenSource’d Sep 2011
• is Eclipse Public License licensed
• http://github.com/nathanmarz/storm
Slide 30
Slide 30 text
What does Storm?
• it’s like M/R but for
real-time computation
• works over streams
• communicates tuples
in a cluster
Spout
Bolt Bolt Bolt
Bolt
Bolt
Slide 31
Slide 31 text
What does Storm?
• Local Development
mode or distributed
• Starts JVMs (workers)
• at-least-once message
processing guarantee
• Storm’s contributions:
scalability, resiliency and
processing guarantee
Spout
Bolt Bolt Bolt
Bolt
Bolt
Slide 32
Slide 32 text
Some Use-Cases
• Analysis on Event-Streams:
• Filtering, Counting, Aggregation
• Monitoring, etc. etc.
• Parallel and Distributed RPC
• Contextualization
Spout
Bolt Bolt
Bolt
Slide 33
Slide 33 text
Message Processing
Guarantee
Spout
Acker
Bolt
Bolt
ID V
Bolt
Slide 34
Slide 34 text
Message Processing
Guarantee
Spout
Acker
Bolt
Bolt
Tuple(id=42)
ID V
(id=42)
Bolt
Slide 35
Slide 35 text
Resiliency
Spout
Acker
Bolt
Bolt
Tuple(id=42)
ID V
42 42
Bolt
Slide 36
Slide 36 text
Resiliency
Spout
Acker
Bolt
Bolt
Tuple(id=42)
ID V
42 42
Bolt
Slide 37
Slide 37 text
Resiliency
Spout
Acker
Bolt
Bolt
ID V
42 42
Tuple(id=40)
Bolt
Tuple(id=4)
Slide 38
Slide 38 text
Resiliency
Spout
Acker
Bolt
Bolt
ID V
42 42^40^4
Tuple(id=40)
Bolt
Tuple(id=4)
(id=[40,4])
Slide 39
Slide 39 text
Resiliency
Spout
Acker
Bolt
Bolt
ID V
42 42^40^4
Tuple(id=40)
Bolt
Tuple(id=4)
(ack=42)
Slide 40
Slide 40 text
Spout
Acker
Bolt
Bolt
ID V
42 40^4
Tuple(id=40)
Bolt
Tuple(id=4)
Message Processing
Guarantee
Slide 41
Slide 41 text
Spout
Acker
Bolt
Bolt
ID V
42 40 ^ 4
(id=40)
Tuple(id=40)
Bolt
Tuple(id=4)
Message Processing
Guarantee
Slide 42
Slide 42 text
Spout
Acker
Bolt
Bolt
ID V
42 4
(id=4)
Bolt
Tuple(id=40) Tuple(id=4)
Message Processing
Guarantee
Slide 43
Slide 43 text
Spout
Acker
Bolt
Bolt
ID V
42 0
Bolt
ack(id=42)
Message Processing
Guarantee
Slide 44
Slide 44 text
Resilience
• a centralized component
coordinates deployment
and starts worker
(Nimbus)
• Workers run distributed
& are supervised
• Online State is persisted
into Zookeeper
• Every component may
fail
Nimbus
ZK ZK ZK
Worker Worker
Slide 45
Slide 45 text
Use-Case
• Use-Case:
Online A/B Testing
• Contextualization: determine
Clique (A | B) online
• Reconfigure A/B-Test
really quick
Spout
Clickstream
∑
New Configuration
User User
Slide 46
Slide 46 text
Use-Case
• Use-Case: Social Graph
Update Propagation
• Send E-Mail to B
• Update Recommendation
Matrix for A (and B)
Spout
‘A follows B now’
A
A
Bolt B
B
Bolt
New Configuration
New ML
Model
Send
EMail
Slide 47
Slide 47 text
Contextualization
with Storm
• Contextualization ✓
• Store Users’ context
in-memory using Bolts
• Continuously persist
state into stable storage
• Towards real-time
context to every request
Spout
Consolidated
Event-Stream
User User
User
Recom-
mender
Trending Stuff /
global stats
Anti-
Spam
Slide 48
Slide 48 text
On Storm
• Storm is not a silver-bullet
• Rather Storm is petri dish for real-time
computation and coordination tasks
• Topology changes: stop-start-cycle required
• There is no Pig Latin / Hive for Storm
• Advanced Topics are added with every
release (e.g. Transactional Semantics)
Slide 49
Slide 49 text
Lessons Learned
• De-Contextualization is a bad thing.
• Your data store won’t help you.
• You have to add some magic to your stack.
• Storm has the potential to become the
Next Big Thing after Hadoop
• Use Storm to fix the Social Impedance
Mismatch Issue
Slide 50
Slide 50 text
Want to change
the world with
real-time data?
contact me:
Martin Scholl
@zeit_geist
Slide 51
Slide 51 text
No content
Slide 52
Slide 52 text
Data Stores
(DBMS, NoSQL)
Event Systems
(e.g. Storm, S4)
Model
Queries
Data
Focus
Dataset Size
Domain
Pull Push
Run Once Run Continuously
Historic Live
Retrieval & Storage
Format Efficiency
Throughput &
Latency
10^9 10^6
Volume Velocity
Slide 53
Slide 53 text
A Note on Time
• Real-Time: milliseconds - seconds
• Near Real-Time: seconds-minutes
• Batch: minutes-