Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On The Social Impedance Mismatch in Data Storage

On The Social Impedance Mismatch in Data Storage

Data Storage affects the way systems are modelled. This slide set shall give a short justification why the data storage and -processing layer needs an overhaul to overcome its current limitations.

Martin Scholl

February 27, 2012
Tweet

More Decks by Martin Scholl

Other Decks in Programming

Transcript

  1. I have a suspicion: Data Store Software is not social

    Martin Scholl <martin@infinipool.com> @zeit_geist
  2. •Notes • are a fact. can get copied. • consist

    of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Music • Making is a process. You can record but not copy music. • flows with the rhythm • lives by the interactions • is uniquely determined in space and time: the Music’s context
  3. Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB

    Graph-DB Flock-DB 1. Social Interaction: Data + Context } 2. just Data }
  4. •Data Stores • store facts. • facts are fix and

    absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.
  5. •Notes • are a fact. can get copied. • consist

    of immutable & absolute entities • have a fixed beginning and ending • music sheet = music essence; “Music’s NoSQL DB” •Data Stores • store facts. • facts are fix and absolute • facts are uniquely determined by key / ID • Data Stores are the source of “truth” • contain what has happened.
  6. Lose Information w/ your fav. Data Store • Data in

    a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch
  7. Lose Information w/ your fav. Data Store • Data in

    a Data Store gets de-contextualized. • You don’t get to know the origin of data but just the fact itself. • irrecoverable information loss! • There is a severe social impedance mismatch
  8. Webserver Webserver Webserver Webserver API-Endpoint API-Endpoint API-Endpoint API-Endpoint Graph-DB Graph-DB

    Graph-DB Flock-DB Data } Data + Context } Data + Context Logic } Context- Engine Context- Engine Context- Engine
  9. Context Engine Requirements • must have a flexible programming model

    • must be scalable and resilient • must be able to integrate and process data from high velocity data sources
  10. Nathan Marz’s Storm • has a flexible programming model •

    is scalable and resilient • integrates and processes data from high velocity data sources
  11. Nathan Marz’s Storm • implemented in Clojure + Java •

    was Backtype proprietary • OpenSource’d Sep 2011 • is Eclipse Public License licensed • http://github.com/nathanmarz/storm
  12. What does Storm? • it’s like M/R but for real-time

    computation • works over streams • communicates tuples in a cluster Spout Bolt Bolt Bolt Bolt Bolt
  13. What does Storm? • Local Development mode or distributed •

    Starts JVMs (workers) • at-least-once message processing guarantee • Storm’s contributions: scalability, resiliency and processing guarantee Spout Bolt Bolt Bolt Bolt Bolt
  14. Some Use-Cases • Analysis on Event-Streams: • Filtering, Counting, Aggregation

    • Monitoring, etc. etc. • Parallel and Distributed RPC • Contextualization Spout Bolt Bolt Bolt
  15. Spout Acker Bolt Bolt ID V 42 40^4 Tuple(id=40) Bolt

    Tuple(id=4) Message Processing Guarantee
  16. Spout Acker Bolt Bolt ID V 42 40 ^ 4

    (id=40) Tuple(id=40) Bolt Tuple(id=4) Message Processing Guarantee
  17. Spout Acker Bolt Bolt ID V 42 4 (id=4) Bolt

    Tuple(id=40) Tuple(id=4) Message Processing Guarantee
  18. Resilience • a centralized component coordinates deployment and starts worker

    (Nimbus) • Workers run distributed & are supervised • Online State is persisted into Zookeeper • Every component may fail Nimbus ZK ZK ZK Worker Worker
  19. Use-Case • Use-Case: Online A/B Testing • Contextualization: determine Clique

    (A | B) online • Reconfigure A/B-Test really quick Spout Clickstream ∑ New Configuration User User
  20. Use-Case • Use-Case: Social Graph Update Propagation • Send E-Mail

    to B • Update Recommendation Matrix for A (and B) Spout ‘A follows B now’ A A Bolt B B Bolt New Configuration New ML Model Send EMail
  21. Contextualization with Storm • Contextualization ✓ • Store Users’ context

    in-memory using Bolts • Continuously persist state into stable storage • Towards real-time context to every request Spout Consolidated Event-Stream User User User Recom- mender Trending Stuff / global stats Anti- Spam
  22. On Storm • Storm is not a silver-bullet • Rather

    Storm is petri dish for real-time computation and coordination tasks • Topology changes: stop-start-cycle required • There is no Pig Latin / Hive for Storm • Advanced Topics are added with every release (e.g. Transactional Semantics)
  23. Lessons Learned • De-Contextualization is a bad thing. • Your

    data store won’t help you. • You have to add some magic to your stack. • Storm has the potential to become the Next Big Thing after Hadoop • Use Storm to fix the Social Impedance Mismatch Issue
  24. Want to change the world with real-time data? contact me:

    Martin Scholl <martin@infinipool.com> @zeit_geist
  25. Data Stores (DBMS, NoSQL) Event Systems (e.g. Storm, S4) Model

    Queries Data Focus Dataset Size Domain Pull Push Run Once Run Continuously Historic Live Retrieval & Storage Format Efficiency Throughput & Latency 10^9 10^6 Volume Velocity
  26. A Note on Time • Real-Time: milliseconds - seconds •

    Near Real-Time: seconds-minutes • Batch: minutes-