Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Data Lifecycle

The Data Lifecycle

Data collection, computation and collaboration.

Matt Wood

June 06, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. Free steak campaign Facebook page Mars exploration ops Consumer social

    app Ticket pricing optimization SAP & Sharepoint Securities Trading Data Archiving Marketing web site Interactive TV apps Financial markets analytics Consumer social app Big data analytics Web site & media sharing Disaster recovery Media streaming Web and mobile apps Streaming webcasts Facebook app Consumer social app Business line of sight Mobile analytics IT operations Digital media Core IT and media Ground campaign
  2. Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007

    Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS
  3. 5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010

    12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS
  4. 1990 2000 2010 2020 The Data Analysis Gap Enterprise Data

    Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  5. Number of calls Call duration Airtime purchase frequency and size

    Mobility patterns CELL PHONES: incredible data generators
  6. Women More calls Longer calls Larger social network More personal

    calls Men Fewer calls Shorter calls Smaller social network More work-related calls
  7. Average daily number of cells that moved out from the

    communal sections. Linus Bengtsson et al. PLoS Medicine, 2011
  8. You Are What You Tweet: Analyzing Twitter for Public Health.

    M. J. Paul and M. Dredze, 2011 Tweeting about Flu
  9. 3.5 billion records 13 TB of click stream logs 71

    million unique cookies Per day:
  10. Speeding server provisioning for R&D apps Extending capacity for internal

    grid environments Slowing internally hosted compute infrastructure growth On-boarding security, validation services and compliance Hosting research data Reducing cost while extending capabilities Challenges
  11. 98% time saved for clinical trial simulations Internal System AWS

    Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336
  12. Reduced burden on pediatric subjects Traditional Design Design Optimized Using

    Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects
  13. AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON S3

    AMAZON RDS AWS STORAGE GATEWAY AMAZON SWF AMAZON GLACIER AWS DATA PIPELINE AMAZON MACHINE IMAGES AMAZON PUBLIC DATASETS CLUSTER COMPUTE INSTANCES HIGH STORAGE INSTANCES HIGH I/O INSTANCES GPU INSTANCES (NO PHI YET, SORRY)
  14. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB
  15. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2
  16. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2 AMAZON S3