Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ziggrid: Processing Data in Near Real-Time Using Couchbase

Chris Tse
September 13, 2013

Ziggrid: Processing Data in Near Real-Time Using Couchbase

Modern digital learning environments collect a multitude of performance, behavioral, and demographics signals from students. The relationships between these data signals can be modeled and wired up using a paradigm called Functional Reactive Programming (FRP). This functional programming style should be familiar to anybody that has developed complex spreadsheets in Excel before.

McGraw-Hill Education's R&D lab is leveraging Couchbase Server's support for running JavaScript-based MapReduce functions incrementally to implement a scalable continuous analytics system that updates higher-level aggregates as lower-level signals are added or changed. Christopher Tse, Head of R&D at MHE, and Dr. Gareth Powell, a functional programming expert and Chief Scientist of Ziniki Networks, will present a data processing architecture that provides Hadoop-like analytic power in near real-time.

Chris Tse

September 13, 2013
Tweet

More Decks by Chris Tse

Other Decks in Technology

Transcript

  1. Ziggrid Processing Data in Near Real-Time Using Couchbase Christopher Tse

    (Head of R&D, McGraw-HIll Education) Gareth Powell, Ph. D. (Chief Scientist, Ziniki Network) CouchConf SF 2013 - Sep 13, 2013
  2. Leveraging EmberJS, a JavaScript MVC framework to rethink the teaching

    and learning experiences on the Web and on mobile devices HTML
  3. Collecting and analyzing multiple streams of student engagement, performance, and

    demographics for dashboards. Data FACT Dimension Dimension Dimension Dimension Dimension
  4. Action Collections EdSense: Real-time Reactions Learning Style Engagement User Intents

    Recommendations Reaction Activity Log Previously Achievements Efficacy
  5. Action Collections EdSense: Real-time Reactions Learning Style Engagement User Intents

    Recommendations Reaction Activity Log Previously Achievements Efficacy
  6. Learning Portal • Designed and built as a collaboration between

    MHE Labs and Couchbase • Serves as proof-of-concept and testing harness for Couchbase + ElasticSearch integration • Available for download and further development as open source code http://github.com/couchbaselabs/learningportal Unveiled during CouchConf SF 2012
  7. SQL Some-sort-of query language ETL To extract, transform and load

    in steps We mean: So we can: Declaratively express the logic for the machine to calculate and process But: Processing complex, multi- layered queries upon request can be slow Store the results from the intermediate or final steps of our calculations Stored data gets out-of- sync with reality. And refresh is often expensive When we say:
  8. Functional reactive programming (FRP) is a programming paradigm for reactive

    programming using the building blocks of functional programming. The key traits of FRP are: • The concept of "behaviors" or "signals" which model values that vary over continuous time. • The concept of "events" which have occurrences at finitely many points in time. • A means to change the FRP system in response to events, generally termed "switching". • The separation of evaluation details such as sampling rate from the reactive model. An additional common but contentious trait is a notion of consistency when ordering events (not just within one stream). Variants include synchrony and glitch freedom. The semantic model of FRP in side-effect free languages is typically in terms of continuous functions, and typically over time. In contrast, integration with a host language that has side- effects is typically given in terms of data flow or dependency graphs by extending the typical operational semantics to manipulate and use them. WTF is FRP?
  9. Functional reactive programming (FRP) is a programming paradigm for reactive

    programming using the building blocks of functional programming. The key traits of FRP are: • The concept of "behaviors" or "signals" which model values that vary over continuous time. • The concept of "events" which have occurrences at finitely many points in time. • A means to change the FRP system in response to events, generally termed "switching". • The separation of evaluation details such as sampling rate from the reactive model. An additional common but contentious trait is a notion of consistency when ordering events (not just within one stream). Variants include synchrony and glitch freedom. The semantic model of FRP in side-effect free languages is typically in terms of continuous functions, and typically over time. In contrast, integration with a host language that has side- effects is typically given in terms of data flow or dependency graphs by extending the typical operational semantics to manipulate and use them. TL;DR WTF is FRP?
  10. Excel is FRP Functional Every cell is either is a

    value or a f(x) that generates a value
  11. Excel is FRP Functional Reactive Every cell is either is

    a value or a f(x) that generates a value If you change one cell, all the other cells that refer to it changes immediately
  12. Excel is FRP Functional Reactive Every cell is either is

    a value or a f(x) that generates a value If you change one cell, all the other cells that refer to it changes immediately
  13. Excel is FRP Functional Reactive Programming Every cell is either

    is a value or a f(x) that generates a value If you change one cell, all the other cells that refer to it changes immediately Yes, you are programming when you are create a model in an Excel spreadsheet
  14. Start with a simple sum() Add more tabs Adding numbers

    within one worksheet To reflect higher level aggregates Excel is FRP
  15. Start with a simple sum() Add more tabs Draw fancy

    graphs Adding numbers within one worksheet To reflect higher level aggregates That visualizes the valuable aggregates Excel is FRP
  16. What if... Cells inside Sheets Documents in JSON Data Model:

    Calculating: When you open the file Visualization: Supported chart types All the time in the cloud Anything drawable in HTML5 Instead of... We have... =SUM(A1:B10) function Sum() { ... } Language:
  17. What if... Cells inside Sheets Documents in JSON Data Model:

    Calculating: When you open the file Visualization: Supported chart types All the time in the cloud Anything drawable in HTML5 Instead of... We have... =SUM(A1:B10) function Sum() { ... } Language:
  18. f(x) f(x) f(x) Ziggrid is FRP Stores values in JSON

    Specifies f(x) in JSON Inside a Couchbase cluster Also builds a dependency graph
  19. f(x) f(x) f(x) Ziggrid is FRP Stores values in JSON

    Specifies f(x) in JSON Inside a Couchbase cluster Also builds a dependency graph Push data out via JSON So clients can render data in HTML5, etc
  20. Ziggrid is FRP Stores values in JSON Specifies f(x) in

    JSON Push data out via JSON Inside a Couchbase cluster Also builds a dependency graph So clients can render data in HTML5, etc f(x) f(x) f(x) “The Ziggurat”
  21. Ziggrid is FRP Stores values in JSON Specifies f(x) in

    JSON Push data out via JSON Inside a Couchbase cluster Also builds a dependency graph So clients can render data in HTML5, etc f(x) f(x) f(x) “The Ziggurat” JS N
  22. Example: Baseball Data Analysis Model Raw Events Enhanced Events Summaries

    Rankings Correlations Snapshots Composites Plate Appearances Player Situation Outcome Player Totals Correlate vs Situation Snapshots of Player Totals Player Profile Snapshots of Correlation Game Results Leaderboards (HR, AVG, PROD) Win / Loss Record
  23. Beane Counter Architecture HTML5 Data Tables and SVG Visualization Ember.js

    + D3.js via WebSockets Middleware Front-end Model Description, Calculation, and Event Chaining Java via Memcached Protocol Backend Raw and Aggregated Data Storage and Indexing Couchbase JSON Store + Incremental MapReduce
  24. Ziggrid Models • Data model described in JSON structure {

    "name": "plateAppearance", "fields": [ { "name": "team", // The team identifier from the Retrosheet Event file "type": "string", "key": true }, { "name": "player", // The player identifier from the Retrosheet Event file "type": "string", "key": true }, { "name": "season", // Year represented as YYYY "type": "string", "key": true }, { "name": "dayOfYear", // 1-365, proxy for which game it was "type": "number", "key": true }, { "name": "inning", // 1-9 for regular innings "type": "number", "key": true }, ... } JS N
  25. { "enhanced": "situation", "from": "plateAppearance", "enhance": { "player": "player", "season":

    "season", "dayOfYear": "dayOfYear", "atbat": { "op": "+", "args": [{ "op": "*", "args": [ 3, "inning" ] }, "outs", -3 ] }, "bases": "bases", "lead": { "op": "group", "value": { "op": "ifelse", "test": "home", "true": { "op": "-", "lhs": "homeScore", "rhs": "awayScore" }, "false": { "op": "-", "lhs": "awayScore", "rhs": "homeScore" } }, "dividers": [ -3, -1, 0, 2 ], // (-inf, -3], (-3, -1], (-1, 0], (0, 2], "moreThan": 3 // (2,inf) }, Ziggrid Algorithms • Data model described in JSON structure • Define all calculation via communative and associative operators JS N
  26. { "composeInto": "profile", "from": "correlate_on_situation_groupedBy_player_and_season", "key": [ "player/", { "field":

    "player" } ], "fields": { "clutchness": "correlation" } }, { "leaderboard": "hotness", "from": "snapshot_playerSeasonToDate", "groupby": [ [ "season", "dayOfYear" ] ], "sortby": [ "average" ], "order": "desc", "values": [ "player" ] }, { "composeInto": "profile", "from": "snapshot_playerSeasonToDate", "key": [ "player/", { "field": "player" } ], "fields": { "hotness": "average" } } ... ] Ziggrid Composites • Data model described in JSON structure • Define all calculation via communative and associative operators • Projecting data via composite definition JS N
  27. Future Improvements Using Couchbase View Engine to do more of

    the processing in the database via Incremental MapReduce. Currently, only the leaderboards are computed using views. GREATER SCALABILITY Expand the functions support by Ziggrid to perform transformation, statistical calculations typical of Big Data analysis, and even ones for machine learning. Allow in-browser development of new models using a subset of data. We need to finish developing a pure JavaScript-based Ziggrid processing engine. Using UPR protocol to be notified of changes in inside Couchbase to allow more immediate, and thus more real-time propagation of events up the Ziggurat. EASIER MODEL DEVELOPMENT REDUCED LATENCY DEEPER ANALYTICS
  28. Thanks to 2 members of the Ember.js Core Team Who

    helped us design and code the sexy Ember + D3.js + WebSockets front-end @machty @stefanpenner