Slide 1

Slide 1 text

What is “social”? Data in a personal context.

Slide 2

Slide 2 text

I have a suspicion: Data Store Software is not social Martin Scholl @zeit_geist

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Column -Store

Slide 5

Slide 5 text

Column -Store Row-Store

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

•Notes • are a fact. can get copied. • consist of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Music • Making is a process. You can record but not copy music. • flows with the rhythm • lives by the interactions • is uniquely determined in space and time: the Music’s context

Slide 9

Slide 9 text

Music = Music Sheet + Context Music Sheet

Slide 10

Slide 10 text

Music = Music Sheet + Context Music Sheet ?

Slide 11

Slide 11 text

Music = Music Sheet + Context Music Sheet ?

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Social Context Data

Slide 14

Slide 14 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB

Slide 15

Slide 15 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB 1. Social Interaction: Data + Context }

Slide 16

Slide 16 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB 1. Social Interaction: Data + Context } 2. just Data }

Slide 17

Slide 17 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB

Slide 18

Slide 18 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB just Data }

Slide 19

Slide 19 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB context-freed Data } just Data }

Slide 20

Slide 20 text

•Data Stores • store facts. • facts are fix and absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.

Slide 21

Slide 21 text

•Notes • are a fact. can get copied. • consist of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Data Stores • store facts. • facts are fix and absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.

Slide 22

Slide 22 text

Lose Information w/ your fav. Data Store • Data in a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch

Slide 23

Slide 23 text

Lose Information w/ your fav. Data Store • Data in a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch

Slide 24

Slide 24 text

How can we fix the social impedance mismatch?

Slide 25

Slide 25 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB Data } Data + Context }

Slide 26

Slide 26 text

Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB Graph-DB Flock-DB Data } Data + Context } Data + Context Logic } Context- Engine Context- Engine Context- Engine

Slide 27

Slide 27 text

Context Engine Requirements • must have a flexible programming model • must be scalable and resilient • must be able to integrate and process data from high velocity data sources

Slide 28

Slide 28 text

Nathan Marz’s Storm • has a flexible programming model • is scalable and resilient • integrates and processes data from high velocity data sources

Slide 29

Slide 29 text

Nathan Marz’s Storm • implemented in Clojure + Java • was Backtype proprietary • OpenSource’d Sep 2011 • is Eclipse Public License licensed •

Slide 30

Slide 30 text

What does Storm? • it’s like M/R but for real-time computation • works over streams • communicates tuples in a cluster Spout Bolt Bolt Bolt Bolt Bolt

Slide 31

Slide 31 text

What does Storm? • Local Development mode or distributed • Starts JVMs (workers) • at-least-once message processing guarantee • Storm’s contributions: scalability, resiliency and processing guarantee Spout Bolt Bolt Bolt Bolt Bolt

Slide 32

Slide 32 text

Some Use-Cases • Analysis on Event-Streams: • Filtering, Counting, Aggregation • Monitoring, etc. etc. • Parallel and Distributed RPC • Contextualization Spout Bolt Bolt Bolt

Slide 33

Slide 33 text

Message Processing Guarantee Spout Acker Bolt Bolt ID V Bolt

Slide 34

Slide 34 text

Message Processing Guarantee Spout Acker Bolt Bolt Tuple(id=42) ID V (id=42) Bolt

Slide 35

Slide 35 text

Resiliency Spout Acker Bolt Bolt Tuple(id=42) ID V 42 42 Bolt

Slide 36

Slide 36 text

Resiliency Spout Acker Bolt Bolt Tuple(id=42) ID V 42 42 Bolt

Slide 37

Slide 37 text

Resiliency Spout Acker Bolt Bolt ID V 42 42 Tuple(id=40) Bolt Tuple(id=4)

Slide 38

Slide 38 text

Resiliency Spout Acker Bolt Bolt ID V 42 42^40^4 Tuple(id=40) Bolt Tuple(id=4) (id=[40,4])

Slide 39

Slide 39 text

Resiliency Spout Acker Bolt Bolt ID V 42 42^40^4 Tuple(id=40) Bolt Tuple(id=4) (ack=42)

Slide 40

Slide 40 text

Spout Acker Bolt Bolt ID V 42 40^4 Tuple(id=40) Bolt Tuple(id=4) Message Processing Guarantee

Slide 41

Slide 41 text

Spout Acker Bolt Bolt ID V 42 40 ^ 4 (id=40) Tuple(id=40) Bolt Tuple(id=4) Message Processing Guarantee

Slide 42

Slide 42 text

Spout Acker Bolt Bolt ID V 42 4 (id=4) Bolt Tuple(id=40) Tuple(id=4) Message Processing Guarantee

Slide 43

Slide 43 text

Spout Acker Bolt Bolt ID V 42 0 Bolt ack(id=42) Message Processing Guarantee

Slide 44

Slide 44 text

Resilience • a centralized component coordinates deployment and starts worker (Nimbus) • Workers run distributed & are supervised • Online State is persisted into Zookeeper • Every component may fail Nimbus ZK ZK ZK Worker Worker

Slide 45

Slide 45 text

Use-Case • Use-Case: Online A/B Testing • Contextualization: determine Clique (A | B) online • Reconfigure A/B-Test really quick Spout Clickstream ∑ New Configuration User User

Slide 46

Slide 46 text

Use-Case • Use-Case: Social Graph Update Propagation • Send E-Mail to B • Update Recommendation Matrix for A (and B) Spout ‘A follows B now’ A A Bolt B B Bolt New Configuration New ML Model Send EMail

Slide 47

Slide 47 text

Contextualization with Storm • Contextualization ✓ • Store Users’ context in-memory using Bolts • Continuously persist state into stable storage • Towards real-time context to every request Spout Consolidated Event-Stream User User User Recom- mender Trending Stuff / global stats Anti- Spam

Slide 48

Slide 48 text

On Storm • Storm is not a silver-bullet • Rather Storm is petri dish for real-time computation and coordination tasks • Topology changes: stop-start-cycle required • There is no Pig Latin / Hive for Storm • Advanced Topics are added with every release (e.g. Transactional Semantics)

Slide 49

Slide 49 text

Lessons Learned • De-Contextualization is a bad thing. • Your data store won’t help you. • You have to add some magic to your stack. • Storm has the potential to become the Next Big Thing after Hadoop • Use Storm to fix the Social Impedance Mismatch Issue

Slide 50

Slide 50 text

Want to change the world with real-time data? contact me: Martin Scholl @zeit_geist

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Data Stores (DBMS, NoSQL) Event Systems (e.g. Storm, S4) Model Queries Data Focus Dataset Size Domain Pull Push Run Once Run Continuously Historic Live Retrieval & Storage Format Efficiency Throughput & Latency 10^9 10^6 Volume Velocity

Slide 53

Slide 53 text

A Note on Time • Real-Time: milliseconds - seconds • Near Real-Time: seconds-minutes • Batch: minutes-