Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Data Lifecycle

The Data Lifecycle

Data collection, computation and collaboration.

Matt Wood

June 06, 2013
Tweet

More Decks by Matt Wood

Other Decks in Technology

Transcript

  1. DATA
    The
    life cycle
    a presentation by
    DR. MATT WOOD

    View Slide

  2. THANK YOU
    Hello, and

    View Slide

  3. SEVEN
    years young

    View Slide

  4. SERVICES
    to support virtually any workload
    Broad and deep

    View Slide

  5. 2007 2008 2009 2010 2011 2012
    159
    82
    61
    48
    24
    9

    View Slide

  6. SECURITY
    capabilities
    Comprehensive

    View Slide

  7. View Slide

  8. EVERY DAY
    to power amazon.com in 2003
    Add enough server capacity

    View Slide

  9. UTILITY
    Computing delivered as a

    View Slide

  10. ECONOMIES
    of scale to lower prices
    Take advantage of the

    View Slide

  11. Free steak campaign
    Facebook page
    Mars exploration ops
    Consumer social app
    Ticket pricing optimization
    SAP & Sharepoint Securities Trading Data Archiving
    Marketing web site Interactive TV apps Financial markets analytics
    Consumer social app Big data analytics
    Web site & media sharing
    Disaster recovery
    Media streaming Web and mobile apps
    Streaming webcasts
    Facebook app Consumer social app
    Business line of sight Mobile analytics
    IT operations Digital media Core IT and media
    Ground campaign

    View Slide

  12. Q4 2006
    Q1 2007
    Q2 2007
    Q3 2007
    Q4 2007
    Q1 2008
    Q2 2008
    Q3 2008
    Q4 2008
    Q1 2009
    Q2 2009
    Q3 2009
    Q4 2009
    Q1 2010
    Q2 2010
    Q3 2010
    Q4 2010
    Q1 2011
    Q2 2011
    Q3 2011
    Q4 2011
    Q1 2012
    Q2 2012
    Q3 2012
    Q4 2012
    Q1 2013
    2 TRILLION OBJECTS

    View Slide

  13. 5/22/2010
    6/12/2010
    7/3/2010
    7/24/2010
    8/14/2010
    9/4/2010
    9/25/2010
    10/16/2010
    11/6/2010
    11/27/2010
    12/18/2010
    1/8/2011
    1/29/2011
    2/19/2011
    3/12/2011
    4/2/2011
    4/23/2011
    5/14/2011
    6/4/2011
    6/25/2011
    7/16/2011
    8/6/2011
    8/27/2011
    9/17/2011
    10/8/2011
    10/29/2011
    11/19/2011
    12/10/2011
    12/31/2011
    1/21/2012
    2/11/2012
    3/3/2012
    3/24/2012
    4/14/2012
    5/5/2012
    5/26/2012
    6/16/2012
    7/7/2012
    7/28/2012
    8/18/2012
    9/8/2012
    9/29/2012
    10/20/2012
    11/10/2012
    12/01/2012
    12/22/2012
    1/12/2013
    2/2/2013
    2/23/2013
    3/16/2013
    4/6/2013
    5.5 MILLION HADOOP CLUSTERS

    View Slide

  14. DATA
    Let’s talk about

    View Slide

  15. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING

    View Slide

  16. DATA
    generation
    Decreasing cost of

    View Slide

  17. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING

    View Slide

  18. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING
    Lower cost,
    Higher throughput

    View Slide

  19. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING
    Lower cost,
    Higher throughput
    Highly constrained

    View Slide

  20. 1990 2000 2010 2020
    The Data Analysis Gap
    Enterprise Data Data in Warehouse
    Generated data
    Available for analysis
    Data volume
    Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
    IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

    View Slide

  21. REMOVES
    resource constraints
    Utility computing

    View Slide

  22. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING
    Lower cost,
    Higher throughput
    Highly constrained

    View Slide

  23. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING

    View Slide

  24. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  25. CELL PHONES
    Let’s talk about

    View Slide

  26. Number of calls
    Call duration
    Airtime purchase frequency and size
    Mobility patterns
    CELL PHONES:
    incredible data generators

    View Slide

  27. Size of purchase
    Number of purchases
    Mobile airtime purchases
    Lower income
    household
    Higher income
    household

    View Slide

  28. Women
    More calls
    Longer calls
    Larger social network
    More personal calls
    Men
    Fewer calls
    Shorter calls
    Smaller social network
    More work-related calls

    View Slide

  29. HEALTH CARE
    Let’s talk about

    View Slide

  30. Average daily number of cells that moved out from the communal sections.
    Linus Bengtsson et al. PLoS Medicine, 2011

    View Slide

  31. Discussion topics
    Sentiment and context
    Social graph
    Interactions
    SOCIAL NETWORKS:
    incredible data generators

    View Slide

  32. You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011
    Tweeting about Flu

    View Slide

  33. Discussing unemployment: Ireland

    View Slide

  34. Discussing unemployment:
    America

    View Slide

  35. Tweeting about Food

    View Slide

  36. Tweets about
    the price of rice
    Official food
    price inflation
    Tweeting about Food

    View Slide

  37. VIDEO GAMES
    Let’s talk about

    View Slide

  38. Search results
    Ad placement
    Buying history
    Page views
    WEB APPLICATIONS:
    incredible data generators

    View Slide

  39. “Who buys video games?”

    View Slide

  40. 3.5 billion records
    13 TB of click stream logs
    71 million unique cookies
    Per day:

    View Slide

  41. View Slide

  42. View Slide

  43. 500% return on ad spend
    17,000% reduction in procurement time
    Results:

    View Slide

  44. GALAXIES
    Let’s talk about

    View Slide

  45. “How do galaxies form?”

    View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. ME
    Let’s talk about

    View Slide

  52. Chromosome 11 : ACTN3 : rs1815739

    View Slide

  53. Chromosome X : rs6625163

    View Slide

  54. Chromosome 19 : FUT2 : rs601338

    View Slide

  55. Chromosome 2 : rs10427255

    View Slide

  56. TYPE II
    Chromosome 10 : rs7903146

    View Slide

  57. +0.25
    Chromosome 15 : rs2472297

    View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. View Slide

  63. GENERATION
    COLLECTION & STORAGE
    ANALYTICS & COMPUTATION
    COLLABORATION & SHARING

    View Slide

  64. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  65. Speeding server provisioning for R&D apps
    Extending capacity for internal grid environments
    Slowing internally hosted compute infrastructure growth
    On-boarding security, validation services and compliance
    Hosting research data
    Reducing cost while extending capabilities
    Challenges

    View Slide

  66. Clinical pharmacology and pharmacometrics
    Molecular dynamics
    Computational genomics
    Research portfolio
    Primary uses

    View Slide

  67. 98% time saved for clinical trial simulations
    Internal System AWS
    Individual Clinical Trial Simulation Run Time (Min) 56 56
    Total Number of Clinical Trial Simulations 2000 2000
    No. Servers 2 256
    No. CPU’s 32 2048
    Total Analysis Run Time (hr) 60 1.2
    Cost ?? $336

    View Slide

  68. Reduced burden on pediatric subjects
    Traditional Design
    Design Optimized Using Clinical Trial
    Simulation
    # of subjects 60 40
    # of blood samples per subject 12 5
    Length of stay per subject 72 hours 26 hours
    Length of study 2.5 years 1.7 years
    Total study cost $700K $250K
    Length and cost projected based on historical data in pediatric subjects

    View Slide

  69. A PLATFORM
    Let’s talk about
    for data and analytics

    View Slide

  70. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  71. AMAZON EC2
    AMAZON REDSHIFT
    AMAZON EMR
    AMAZON
    DYNAMODB
    AMAZON S3 AMAZON RDS
    AWS STORAGE
    GATEWAY
    AMAZON SWF
    AMAZON GLACIER AWS DATA PIPELINE
    AMAZON
    MACHINE IMAGES
    AMAZON PUBLIC
    DATASETS
    CLUSTER COMPUTE
    INSTANCES
    HIGH STORAGE
    INSTANCES
    HIGH I/O
    INSTANCES
    GPU INSTANCES
    (NO PHI YET, SORRY)

    View Slide

  72. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  73. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

    View Slide

  74. PRE-REQUISITE
    Ease of use is a

    View Slide

  75. RIGHT LEVEL
    Expose data at the

    View Slide

  76. GENERATION COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING

    View Slide

  77. GENERATION COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    JASPERSOFT ON
    AMAZON EC2
    AMAZON REDSHIFT
    AMAZON EMR
    AMAZON
    DYNAMODB

    View Slide

  78. GENERATION COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    JASPERSOFT ON
    AMAZON EC2
    AMAZON REDSHIFT
    AMAZON EMR
    AMAZON
    DYNAMODB AMAZON RDS
    AMAZON RDS AMAZON EC2

    View Slide

  79. GENERATION COLLECTION &
    STORAGE
    COMPUTATION &
    ANALYTICS
    COLLABORATION &
    SHARING
    JASPERSOFT ON
    AMAZON EC2
    AMAZON REDSHIFT
    AMAZON EMR
    AMAZON
    DYNAMODB AMAZON RDS
    AMAZON RDS AMAZON EC2
    AMAZON S3

    View Slide

  80. Amazon S3
    http://www.youtube.com/watch?v=oGcZ7WVx6EI
    Legacy data warehousing
    Cassandra Aegisthus Hadoop, Hive, Pig

    View Slide

  81. Amazon S3
    http://www.youtube.com/watch?v=oGcZ7WVx6EI
    Legacy data warehousing
    Cassandra Aegisthus Hadoop, Hive, Pig
    Microstrategy
    Sting
    R

    View Slide

  82. View Slide

  83. Technologies and techniques
    for working productively with
    data, at any scale.

    View Slide

  84. 12.5 3
    years hours

    View Slide

  85. 12.5 3
    $20M $4k
    years hours

    View Slide

  86. View Slide

  87. $1k
    Less than
    today

    View Slide

  88. 1,000,000+
    core hours

    View Slide

  89. [email protected]
    aws.amazon.com

    View Slide