High-throughput data analysis - J Singh, President, Early Stage IT

High-throughput data analysis A Streaming Reports Platform Authors J Singh,
Early Stage IT David Zheng, Worcester Polytechnic Institute Contributor Satya Gupta, Virsec Systems October 3, 2011

2 © J Singh, 2011 2 Motivating Problem • 
Resolve Virtual Machine profiles application and gathers data about the application –  The data analysis used to take several days before conclusions could be drawn •  Project goals –  Stream-mode analysis •  Analysis and reporting should begin within a few seconds of start of profiling •  And continuously update for the duration –  Data rates up to 5 GB per hour –  Ability to sustain rate for 24 hrs Analysis and Reporting configured to run in the Amazon EC2 environment –  Can be scaled up (bigger machines) –  Or scaled out (more machines) A product of Virsec Systems

3 © J Singh, 2011 3 Approach • A variant
of Eric Ries’ Lean Startup approach –  Introduced in his book, The Lean Startup, •  Crown Publishing Group, 2011 –  It is recipe for learning quickly –  Re-adapted by us for this and other “learning projects” 1.  Do what it takes to get an end- to-end solution 2.  Measure, Learn, Build, repeat 3.  When the cycle has stabilized, expand scope appropriately

4 © J Singh, 2011 4 Requirements • Fast inserts
into the database • The nature and amount of analysis required was hard to judge in the beginning –  Previous experience with Map/Reduce in the Google App Engine environment had shown promise but GAE was not appropriate for this application • Slick, demo-worthy web interface for presenting results • Stream-mode operation –  Start showing results within a few seconds of starting the Resolve Virtual Machine, and update it periodically as more data is collected and analyzed. The Lessons from this project apply well to other “feed-based” data such as stock prices, server logs, sensor data.

5 © J Singh, 2011 5 Components of our
solution (p1) • Listener –  Receives the data from the Resolve Virtual Machine and stores it into MongoDB •  Self-describing data •  12 different types of data fed over 12 different sockets –  Written in C++ •  Socket Interface at one end •  MongoDB C++ driver at other end –  Goal: push the data into MongoDB as fast as possible –  Slice data into 1-second chunks •  Signal to next stage of the pipeline by using a “signaling collection”

solution (p2) • MongoDB –  Goal: persistence –  Will use replica sets for making the data available to analysis servers

solution (p3) • Analysis Program –  “Function Call Structure” Data Type •  Calculation of ΔT was better done in the listener, moved there. •  Could scale the solution up, but could not scale it out •  Map Reduce was much faster and could scale out –  “Memory Usage” Data Type •  Needed multiple map/reduce stages •  Needed reduce-less map/reduce –  Not supported by Mongo Map Reduce –  Switching to Hadoop/MongoDB

8 © J Singh, 2011 8 “Function Call Structure”
Analysis map (emit: FnName, TotalTime, min_addr, NumOfCalls, PID, …) Shuffle stage: FnName: CreateRaceObjects { {TotalTime: 3, min_addr: 2 , NumOfCalls: 1 , PID: 1, …} {TotalTime: 7, min_addr: 3, NumOfCalls: 1, PID: 1, …} {TotalTime: 4, min_addr: 1, NumOfCalls: 1, PID: 1, …} {TotalTime: 6, min_addr: 1, NumOfCalls: 1, PID: 1, …} } FnName TotalTime SrcFnAddress PID reduce Output: {FnName: CreateRaceObjects {TotalTime: 20, min_addr: 1 , NumOfCalls: 2 , PID: 1, …} SF_CFA

solution (p4) • Presentation –  A separate page design for each data type –  Tool of choice: DjangoNonRel –  But Python driver for MongoDB was sufficient for most work. –  DjangoNonRel was not really required.

10 © J Singh, 2011 10 Progression of Solutions
Element Initial Choice Comments Final Choice Database MongoDB Worked well. Our solution can be made more robust by using Replica Sets & Sharding MongoDB Analysis Programs Java Driver for MongoDB Many Iterations 1.  Denormalization in the listener 2.  ΔT Calculations in the listener, analysis using Map/Reduce MongoDB + Hadoop Presentation DjangoNonRel The “NonRel” requirement was pretty minimal. Python Driver for MongoDB was enough Django + Python Driver for MongoDB

11 © J Singh, 2011 11 Thanks • To Virsec
Systems –  For providing a robust test case • To Amazon Web Services –  For an educational grant to WPI for use of AWS resources

12 © J Singh, 2011 12 About Us • Involved
with Map/Reduce and NoSQL technologies on several platforms • Many students in J’s Database Systems class at WPI did a project on a NoSQL database. • DataThinks.org is a new service of Early Stage IT –  Building and operating “Big Data” analytics services Thanks

High-throughput data analysis - J Singh, Presid...

High-throughput data analysis - J Singh, President, Early Stage IT

mongodb

More Decks by mongodb

Other Decks in Technology

Featured

Transcript

High-throughput data analysis A Streaming Reports Platform Authors J Singh,

2 © J Singh, 2011 2 Motivating Problem •

3 © J Singh, 2011 3 Approach • A variant

4 © J Singh, 2011 4 Requirements • Fast inserts

5 © J Singh, 2011 5 Components of our

6 © J Singh, 2011 6 Components of our

7 © J Singh, 2011 7 Components of our

8 © J Singh, 2011 8 “Function Call Structure”

9 © J Singh, 2011 9 Components of our

10 © J Singh, 2011 10 Progression of Solutions

11 © J Singh, 2011 11 Thanks • To Virsec

12 © J Singh, 2011 12 About Us • Involved