Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oporto MongoDB User Group, Visualizing an evolv...

Oporto MongoDB User Group, Visualizing an evolving online social network using the twitter stream

link to demo video here: http://www.screenr.com/462H

Avatar for Mário Cordeiro

Mário Cordeiro

October 15, 2013
Tweet

More Decks by Mário Cordeiro

Other Decks in Technology

Transcript

  1. objective Visualizing an evolving online social network using the twitter

    stream “static” twitter directed network • following following following following and followers followers followers followers user relations • do not reflect “realtime” interaction between users
  2. objective Visualizing an evolving online social network using the twitter

    stream “dynamic” twitter directed network • reply, retweet, user mention • favorites, hash tags, urls • “current” interaction between users • information spreading
  3. objective Visualizing an evolving online social network using the twitter

    stream random sample of all public tweets statuses • https://stream.twitter.com/1.1/statuses/sample.json – firehose: all public statuses – gardenhose: 10% 10% 10% 10% of all public statuses – spritzer: 1 1 1 1% % % % of all public statuses (free free free free) • response in JavaScript Object Notation (JSON)
  4. objective Visualizing an evolving online social network using the twitter

    stream @EdwardTufte @mashable @kallemaatta @vegan_mum @janettwokay @panji_g @Feenix747 +34 retweets
  5. objective Visualizing an evolving online social network using the twitter

    stream @EdwardTufte @mashable @kallemaatta @vegan_mum @janettwokay @panji_g @Feenix747 +34 retweets
  6. objective Visualizing an evolving online social network using the twitter

    stream expected data throughput • average of 175 million tweets sent per day • spritzer: roughly 1.750.000 tweets • ~20 tweets per second • 245M bytes per day (1.750.000 tweets x 140 chars) • 2800 bytes per second
  7. problems to solve data storage and acquisition: • JSON, fast

    inserts, minimal data transformation data processing and transformation: • group data by users (nodes) • build relations of users (edges) event-driven web server: • time from message arrival to visualization web client and visualization: • time decay: add new nodes/edges, remove old ones
  8. data storage NoSQL document database: • storing tweet data in

    native format (JSON) • bulk import of documents • decoupling between tweet format • MongoDB MongoDB MongoDB MongoDB
  9. data storage NoSQL document database importing document with mongoimport: •

    supports JSON and CSV documents • supports streams of documents • simple, fast and effective: curl -k https://stream.twitter.com/1.1/statuses/sample.json <OAuth> | mongoimport -d twitter -c some.capped.collection
  10. data storage query data: • using the mongo shell (mongo

    and find()): printjson(db.some.capped.collection.find().limit(1).toArray())
  11. data processing and transformation processing the data: • MongoDB support

    map-reduce tasks • map-reduce implemented in server side JavaScript solving the limited resources problem: • capped collections – work in a way similar to circular buffers: once a collection fills its allocated space, it makes room for new documents by overwriting the oldest documents in the collection db.createCollection("some.capped.collection", {capped:true, size:1000000});
  12. data processing and transformation trigger event after new document insert:

    • tailable cursors – conceptually equivalent to the tail Unix command with the -f option (i.e. with “follow” mode.) After clients insert new additional documents into a capped collection, the tailable cursor will continue to retrieve documents while (1) { cursor = coll.find({ 'created_at': { '$gt': lastVal } }); cursor.addOption(2); // tailable cursor.addOption(32); // await data while (cursor.hasNext()){ // incremental map-reduce } }
  13. data processing and transformation map-reduce • map() – group the

    tweets by retweeted user • reduce() – build lists » retweeted_status_user_id » entities_user_mentions_id db.some.capped.collection.mapReduce(time_map, time_reduce, { out: { reduce: "some.capped.collection.retweets_per_time_interval" }, scope : { sample_interval : sample_interval } });
  14. data processing and transformation "RT ( @Diamantjunior17 ) #dames http://t.co/ihuQSEjZzM”

    • by: Diamantjunio, qwerty1, Asuka_00k and payne_hemmings
  15. data processing and transformation { "_id": 309971305, "value": { "created_at_date_truncated":

    ISODate("2013-06-15T08:55:30Z"), "retweeted_status_id": [ NumberLong("345673389547069442") ], "retweeted_status_user_id": 309971305, "retweeted_status_user_screen_name": "CuteOrNott CuteOrNott CuteOrNott CuteOrNott", "text": [ "RT ( RT ( RT ( RT ( \ \ \ \@Diamantjunior17 ) #dames http:// @Diamantjunior17 ) #dames http:// @Diamantjunior17 ) #dames http:// @Diamantjunior17 ) #dames http://t.co/ihuQSEjZzM t.co/ihuQSEjZzM t.co/ihuQSEjZzM t.co/ihuQSEjZzM" ], "user_id": [ 1476840536, 1235764556, 98346775345, 345356783 ], "user_name": [ "Diamantjunio Diamantjunio Diamantjunio Diamantjunio", "qwerty1 qwerty1 qwerty1 qwerty1", "Asuka_00k Asuka_00k Asuka_00k Asuka_00k", "payne_hemmings payne_hemmings payne_hemmings payne_hemmings" ], "entities_user_mentions_id": [ 309971305 ], "entities_user_mentions_name": [ "Diamantjunior17 Diamantjunior17 Diamantjunior17 Diamantjunior17" ], "retweeted_count": 4, "user_mention_count": 1, "action": "insert" } },
  16. data processing and transformation Incremental map-reduce approach: tweets tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM
  17. data processing and transformation Incremental map-reduce approach: tweets tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network
  18. data processing and transformation Incremental map-reduce approach: • graph will

    grow infinitely • old nodes and edges should be removed after an some predefined time
  19. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network Sliding window (2 x 5s)
  20. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network Sliding window (2 x 5s)
  21. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network Sliding window (2 x 5s) Remove Remove Remove Remove Edges from Edges from Edges from Edges from Graph Graph Graph Graph
  22. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network Sliding window (2 x 5s) Remove Remove Remove Remove Edges from Edges from Edges from Edges from Graph Graph Graph Graph
  23. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM Add Edges to Add Edges to Add Edges to Add Edges to Graph Graph Graph Graph graph network Sliding window (2 x 5s) Remove Remove Remove Remove Edges from Edges from Edges from Edges from Graph Graph Graph Graph
  24. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM graph network Sliding window (2 x 5s) Remove Remove Remove Remove Edges from Edges from Edges from Edges from Graph Graph Graph Graph
  25. data processing and transformation Sliding window map-reduce approach: tweets tweets

    tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets tweets t0 t1 t2 t3 … tn 5s edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM edges edges edges edges RT + UM RT + UM RT + UM RT + UM graph network Remove Remove Remove Remove Edges from Edges from Edges from Edges from Graph Graph Graph Graph
  26. real-time graph visualization Backend architecture (webserver) • main requirement: –

    update the graph adding and removing nodes/edges • approach – event-driven: update of webpage using MongoDB tailable cursors – webserver implemented in node.js – events fired by sockets using socket.io – access to database using the node MongoDB driver
  27. real-time graph visualization Backend architecture (webserver) • orchestration (node.js) –

    data acquisition (cURL + mongoimport) – execution of map-reduce tasks • firing of events to clients (node.js) – query to capped collection using a tailable cursor listen new events (add or removal of nodes/edges) – tutorial: http://sett.ociweb.com/sett/settMar2012.html
  28. real-time graph visualization Client architecture (webpage) • process received events

    from server • update graph (sigma.js) with – new nodes and edges – removed nodes and edges – Perform layout of the graph using the ForceAtlas2 algorithm
  29. summary MongoDB: • json • mongoimport • capped collections •

    tailable cursors • map-reduce node.js + socket.io + node MongoDB driver • event driven webserver sigma.js
  30. summary using MongoDB • powerful json datasource manipulation – perfect

    match between MongoDB and twitter stream – map-reduce, aggregation framework • building pipeline processing blocks – Capped/TTL collections and tailable cursors data transformation data transformation data map-reduce map-reduce in: capped collection out: capped collection out: capped collection tailable cursor tailable cursor tailable cursor