Upgrade to Pro — share decks privately, control downloads, hide ads and more …

High-throughput data analysis - J Singh, President, Early Stage IT

mongodb
October 07, 2011

High-throughput data analysis - J Singh, President, Early Stage IT

MongoBoston 2011

Receiving data from a source that produces 5-10 GBytes per hour, sustained for 24 hrs, and presenting analysis results as the data streams in has... some interesting challenges. We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time.

mongodb

October 07, 2011
Tweet

More Decks by mongodb

Other Decks in Technology

Transcript

  1. High-throughput data analysis A Streaming Reports Platform Authors J Singh,

    Early Stage IT David Zheng, Worcester Polytechnic Institute Contributor Satya Gupta, Virsec Systems October 3, 2011
  2. 2 ©  J  Singh,  2011   2 Motivating Problem • 

    Resolve Virtual Machine… profiles application and gathers data about the application –  The data analysis used to take several days before conclusions could be drawn •  Project goals –  Stream-mode analysis •  Analysis and reporting should begin within a few seconds of start of profiling •  And continuously update for the duration –  Data rates up to 5 GB per hour –  Ability to sustain rate for 24 hrs Analysis and Reporting configured to run in the Amazon EC2 environment –  Can be scaled up (bigger machines) –  Or scaled out (more machines) …A product of Virsec Systems
  3. 3 ©  J  Singh,  2011   3 Approach • A variant

    of Eric Ries’ Lean Startup approach –  Introduced in his book, The Lean Startup, •  Crown Publishing Group, 2011 –  It is recipe for learning quickly –  Re-adapted by us for this and other “learning projects” 1.  Do what it takes to get an end- to-end solution 2.  Measure, Learn, Build, repeat 3.  When the cycle has stabilized, expand scope appropriately
  4. 4 ©  J  Singh,  2011   4 Requirements • Fast inserts

    into the database • The nature and amount of analysis required was hard to judge in the beginning –  Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application • Slick, demo-worthy web interface for presenting results • Stream-mode operation –  Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed. The Lessons from this project apply well to other “feed-based” data such as stock prices, server logs, sensor data.
  5. 5 ©  J  Singh,  2011   5 Components of our

    solution (p1) • Listener –  Receives the data from the Resolve Virtual Machine and stores it into MongoDB •  Self-describing data •  12 different types of data fed over 12 different sockets –  Written in C++ •  Socket Interface at one end •  MongoDB C++ driver at other end –  Goal: push the data into MongoDB as fast as possible –  Slice data into 1-second chunks •  Signal to next stage of the pipeline by using a “signaling collection”
  6. 6 ©  J  Singh,  2011   6 Components of our

    solution (p2) • MongoDB –  Goal: persistence –  Will use replica sets for making the data available to analysis servers
  7. 7 ©  J  Singh,  2011   7 Components of our

    solution (p3) • Analysis Program –  “Function Call Structure” Data Type •  Calculation of ΔT was better done in the listener, moved there. •  Could scale the solution up, but could not scale it out •  Map Reduce was much faster and could scale out –  “Memory Usage” Data Type •  Needed multiple map/reduce stages •  Needed reduce-less map/reduce –  Not supported by Mongo Map Reduce –  Switching to Hadoop/MongoDB
  8. 8 ©  J  Singh,  2011   8 “Function Call Structure”

    Analysis map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } FnName TotalTime SrcFnAddress PID reduce Output: {FnName: CreateRaceObjects {TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …} SF_CFA
  9. 9 ©  J  Singh,  2011   9 Components of our

    solution (p4) • Presentation –  A separate page design for each data type –  Tool of choice: DjangoNonRel –  But Python driver for MongoDB was sufficient for most work. –  DjangoNonRel was not really required.
  10. 10 ©  J  Singh,  2011   10 Progression of Solutions

    Element Initial Choice Comments Final Choice Database MongoDB Worked well. Our solution can be made more robust by using Replica Sets & Sharding MongoDB Analysis Programs Java Driver for MongoDB Many Iterations 1.  Denormalization in the listener 2.  ΔT Calculations in the listener, analysis using Map/Reduce MongoDB + Hadoop Presentation DjangoNonRel The “NonRel” requirement was pretty minimal. Python Driver for MongoDB was enough Django + Python Driver for MongoDB
  11. 11 ©  J  Singh,  2011   11 Thanks • To Virsec

    Systems –  For providing a robust test case • To Amazon Web Services –  For an educational grant to WPI for use of AWS resources
  12. 12 ©  J  Singh,  2011   12 About Us • Involved

    with Map/Reduce and NoSQL technologies on several platforms • Many students in J’s Database Systems class at WPI did a project on a NoSQL database. • DataThinks.org is a new service of Early Stage IT –  Building and operating “Big Data” analytics services Thanks