The Data Lifecycle

The Data Lifecycle

Data collection, computation and collaboration.

39488f9d172ab92fd352f2cd7b73258d?s=128

Matt Wood

June 06, 2013
Tweet

Transcript

  1. DATA The life cycle a presentation by DR. MATT WOOD

  2. THANK YOU Hello, and

  3. SEVEN years young

  4. SERVICES to support virtually any workload Broad and deep

  5. 2007 2008 2009 2010 2011 2012 159 82 61 48

    24 9
  6. SECURITY capabilities Comprehensive

  7. None
  8. EVERY DAY to power amazon.com in 2003 Add enough server

    capacity
  9. UTILITY Computing delivered as a

  10. ECONOMIES of scale to lower prices Take advantage of the

  11. Free steak campaign Facebook page Mars exploration ops Consumer social

    app Ticket pricing optimization SAP & Sharepoint Securities Trading Data Archiving Marketing web site Interactive TV apps Financial markets analytics Consumer social app Big data analytics Web site & media sharing Disaster recovery Media streaming Web and mobile apps Streaming webcasts Facebook app Consumer social app Business line of sight Mobile analytics IT operations Digital media Core IT and media Ground campaign
  12. Q4 2006 Q1 2007 Q2 2007 Q3 2007 Q4 2007

    Q1 2008 Q2 2008 Q3 2008 Q4 2008 Q1 2009 Q2 2009 Q3 2009 Q4 2009 Q1 2010 Q2 2010 Q3 2010 Q4 2010 Q1 2011 Q2 2011 Q3 2011 Q4 2011 Q1 2012 Q2 2012 Q3 2012 Q4 2012 Q1 2013 2 TRILLION OBJECTS
  13. 5/22/2010 6/12/2010 7/3/2010 7/24/2010 8/14/2010 9/4/2010 9/25/2010 10/16/2010 11/6/2010 11/27/2010

    12/18/2010 1/8/2011 1/29/2011 2/19/2011 3/12/2011 4/2/2011 4/23/2011 5/14/2011 6/4/2011 6/25/2011 7/16/2011 8/6/2011 8/27/2011 9/17/2011 10/8/2011 10/29/2011 11/19/2011 12/10/2011 12/31/2011 1/21/2012 2/11/2012 3/3/2012 3/24/2012 4/14/2012 5/5/2012 5/26/2012 6/16/2012 7/7/2012 7/28/2012 8/18/2012 9/8/2012 9/29/2012 10/20/2012 11/10/2012 12/01/2012 12/22/2012 1/12/2013 2/2/2013 2/23/2013 3/16/2013 4/6/2013 5.5 MILLION HADOOP CLUSTERS
  14. DATA Let’s talk about

  15. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

  16. DATA generation Decreasing cost of

  17. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

  18. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

    Lower cost, Higher throughput
  19. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

    Lower cost, Higher throughput Highly constrained
  20. 1990 2000 2010 2020 The Data Analysis Gap Enterprise Data

    Data in Warehouse Generated data Available for analysis Data volume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  21. REMOVES resource constraints Utility computing

  22. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

    Lower cost, Higher throughput Highly constrained
  23. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

  24. Technologies and techniques for working productively with data, at any

    scale.
  25. CELL PHONES Let’s talk about

  26. Number of calls Call duration Airtime purchase frequency and size

    Mobility patterns CELL PHONES: incredible data generators
  27. Size of purchase Number of purchases Mobile airtime purchases Lower

    income household Higher income household
  28. Women More calls Longer calls Larger social network More personal

    calls Men Fewer calls Shorter calls Smaller social network More work-related calls
  29. HEALTH CARE Let’s talk about

  30. Average daily number of cells that moved out from the

    communal sections. Linus Bengtsson et al. PLoS Medicine, 2011
  31. Discussion topics Sentiment and context Social graph Interactions SOCIAL NETWORKS:

    incredible data generators
  32. You Are What You Tweet: Analyzing Twitter for Public Health.

    M. J. Paul and M. Dredze, 2011 Tweeting about Flu
  33. Discussing unemployment: Ireland

  34. Discussing unemployment: America

  35. Tweeting about Food

  36. Tweets about the price of rice Official food price inflation

    Tweeting about Food
  37. VIDEO GAMES Let’s talk about

  38. Search results Ad placement Buying history Page views WEB APPLICATIONS:

    incredible data generators
  39. “Who buys video games?”

  40. 3.5 billion records 13 TB of click stream logs 71

    million unique cookies Per day:
  41. None
  42. None
  43. 500% return on ad spend 17,000% reduction in procurement time

    Results:
  44. GALAXIES Let’s talk about

  45. “How do galaxies form?”

  46. None
  47. None
  48. None
  49. None
  50. None
  51. ME Let’s talk about

  52. Chromosome 11 : ACTN3 : rs1815739

  53. Chromosome X : rs6625163

  54. Chromosome 19 : FUT2 : rs601338

  55. Chromosome 2 : rs10427255

  56. TYPE II Chromosome 10 : rs7903146

  57. +0.25 Chromosome 15 : rs2472297

  58. None
  59. None
  60. None
  61. None
  62. None
  63. GENERATION COLLECTION & STORAGE ANALYTICS & COMPUTATION COLLABORATION & SHARING

  64. Technologies and techniques for working productively with data, at any

    scale.
  65. Speeding server provisioning for R&D apps Extending capacity for internal

    grid environments Slowing internally hosted compute infrastructure growth On-boarding security, validation services and compliance Hosting research data Reducing cost while extending capabilities Challenges
  66. Clinical pharmacology and pharmacometrics Molecular dynamics Computational genomics Research portfolio

    Primary uses
  67. 98% time saved for clinical trial simulations Internal System AWS

    Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336
  68. Reduced burden on pediatric subjects Traditional Design Design Optimized Using

    Clinical Trial Simulation # of subjects 60 40 # of blood samples per subject 12 5 Length of stay per subject 72 hours 26 hours Length of study 2.5 years 1.7 years Total study cost $700K $250K Length and cost projected based on historical data in pediatric subjects
  69. A PLATFORM Let’s talk about for data and analytics

  70. Technologies and techniques for working productively with data, at any

    scale.
  71. AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON S3

    AMAZON RDS AWS STORAGE GATEWAY AMAZON SWF AMAZON GLACIER AWS DATA PIPELINE AMAZON MACHINE IMAGES AMAZON PUBLIC DATASETS CLUSTER COMPUTE INSTANCES HIGH STORAGE INSTANCES HIGH I/O INSTANCES GPU INSTANCES (NO PHI YET, SORRY)
  72. Technologies and techniques for working productively with data, at any

    scale.
  73. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

  74. PRE-REQUISITE Ease of use is a

  75. RIGHT LEVEL Expose data at the

  76. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

  77. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB
  78. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2
  79. GENERATION COLLECTION & STORAGE COMPUTATION & ANALYTICS COLLABORATION & SHARING

    JASPERSOFT ON AMAZON EC2 AMAZON REDSHIFT AMAZON EMR AMAZON DYNAMODB AMAZON RDS AMAZON RDS AMAZON EC2 AMAZON S3
  80. Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,

    Pig
  81. Amazon S3 http://www.youtube.com/watch?v=oGcZ7WVx6EI Legacy data warehousing Cassandra Aegisthus Hadoop, Hive,

    Pig Microstrategy Sting R
  82. None
  83. Technologies and techniques for working productively with data, at any

    scale.
  84. 12.5 3 years hours

  85. 12.5 3 $20M $4k years hours

  86. None
  87. $1k Less than today

  88. 1,000,000+ core hours

  89. matthew@amazon.com aws.amazon.com