$30 off During Our Annual Pro Sale. View Details »

From Data to Meaning (Cloud Developer Roadshow 2014)

From Data to Meaning (Cloud Developer Roadshow 2014)

Here we talk about the tools and methodologies that you can use to gain insight from your data using a specific example from the Personal Genome Project. This deck was delivered during the Google Cloud Platform Developer Roadshow events in 2014.

GoogleCloudPlatform

August 20, 2014
Tweet

More Decks by GoogleCloudPlatform

Other Decks in Technology

Transcript

  1. From Data to Meaning
    A step by step guide
    Google Cloud Platform Developer Roadshow - 2014

    View Slide

  2. Hadoop Cluster Anyone?
    Images by Connie Zhou
    $ ./bdutil deploy

    View Slide

  3. Mapreduce
    A distributed computing paradigm that consists of:
    ● A map step, performed on subsets of the input data
    ● A reduce step, that combines the output together
    A Mapreduce can run on structured or unstructured data
    Hadoop
    An Open Source implementation of Mapreduce
    Some Definitions
    Apache™ Hadoop®

    View Slide

  4. Google Cloud Storage
    BigQuery and Datastore Connectors
    Hadoop
    BigQuery
    Connector
    Datastore
    Connector
    Cloud Storage
    Connector
    BigQuery
    Datastore

    View Slide

  5. Deployment, Configuration, and Toolkits
    bdutil - Thin wrapper around Cloud SDK command-line tools
    • Deploy Hadoop cluster on-demand
    • Installs and configures GCS connector
    • Extensible - add scripts to run during deployment
    Hadoop Master
    bdutil Hadoop Workers
    Compute
    Engine API

    View Slide

  6. Let’s talk briefly about Big Data...
    Courtesy of Nimbus Ninety – www.nimbusninety.com

    View Slide

  7. ● Personalised insights into your customer base
    ○ Loyalty Cards, In App Purchases, Point of Sale offers
    ● User retention activities
    ○ For Games, Mobile Apps, anything that has pattern based usage
    ● Predicting health problems
    ● Website optimization through analytics
    ● Enabling future breakthroughs in biology and medicine
    ● Autonomous Traffic Lights and Flying Cars
    Some examples

    View Slide

  8. https://developers.google.com/genomics/
    Source: iStockPhoto

    View Slide

  9. Your Your
    Business

    View Slide

  10. Personal Genome Project
    Where is your data?
    Genomic, Environmental, and Human Trait data

    View Slide

  11. Sometimes you just need to
    do something simple… but
    you need to do a lot of it!
    Source: iStockPhoto

    View Slide

  12. How Big is a Human Genome?
    Images by Connie Zhou
    $ gsutil ls gs://pgp-harvard-data-public/hu*/*/*/*/ASM/master*
    gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669-
    DNA_B05/ASM/masterVarBeta-GS000015172-ASM.tsv.bz2
    gs://pgp-harvard-data-public/hu016B28/GS000018110-DID/GS000014561-ASM/GS01669-
    DNA_H03/ASM/masterVarBeta-GS000014561-ASM.tsv.bz2
    + 173 more…
    $ wc masterVarBeta-GS000015172-ASM.tsv
    17615429 306261586 2349556328 masterVarBeta-GS000015172-ASM.tsv
    One File Per
    Genome
    Participant Identifier
    Buried in Object Name

    View Slide

  13. What we have isn’t what we want
    Images by Connie Zhou
    $ head -n 30 masterVarBeta-GS000015172-ASM.tsv | cut -f -10
    #ASSEMBLY_ID GS000015172-ASM
    #CNV_DIPLOID_WINDOW_WIDTH 2000
    #CNV_NONDIPLOID_WINDOW_WIDTH 100000
    [...]
    >locus ploidy chromosome begin end zygosity varType reference allele1Seq
    allele2Seq
    1 2 chr1 0 10000 no-call no-ref = ? ?
    2 2 chr1 10000 10476 no-call complex = ? ?
    3 2 chr1 10476 10481 hom ref = = =
    4 2 chr1 10481 10518 no-call complex = ? ?
    [...]

    View Slide

  14. Wheelbarrow or Giant Truck?
    Images by Connie Zhou
    $ ./cgi-mapper.py ./hu011C57/masterVarBeta-GS000015172-ASM.tsv
    demo-m$ ./hadoop-install/bin/hadoop jar \
    ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
    -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
    -mapper cgi-mapper.py \
    -file cgi-mapper.py \
    --numReduceTasks 0 \
    -output gs://big-data-roadshow/output

    View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. Goodbye Cluster!
    Images by Connie Zhou
    $ ./bdutil delete

    View Slide

  19. 434 GB, 3 billion rows
    bdutil deploy: ~3.5 minutes
    hadoop job: ~7 minutes
    bdutil delete: ~1.5 minutes
    Total time with manual entry: ~14 minutes
    n1-standard-16 instances: $1.16 / hour
    But what did it cost?

    View Slide

  20. Compute: $5.68
    But what did it cost?
    What about storage? 434 GB @ $0.026 GB / month
    $0.38 / day
    $11.28 / month
    What about networking?
    No Charge!

    View Slide

  21. But there’s more...

    View Slide

  22. Google BigQuery

    View Slide

  23. BigQuery: Big Data Analytics in the Cloud
    Unrivaled
    Performance and Scale
    ● Scan multiple TB’s in seconds
    ● Interactive query
    performance
    ● No limits on amount of data
    Ease of Use
    and Adoption
    ● No administration /
    provisioning
    ● Convenience of SQL
    ● Open interfaces
    (REST, WebUI, ODBC)
    ● First 1 TB of data processed
    per month is free
    Advanced “Big Data”
    Storage
    ● Familiar database structure
    ● Easy data management and
    ACL’s
    ● Fast, atomic imports

    View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. http://www.nih.gov/news/health/may2014/ninds-09.htm

    View Slide

  29. http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs9536314

    View Slide

  30. Subject chromosome reference allele1Seq allele2Seq Note
    huCCAFD0 chr13 = = = No Variant
    huE58004 chr13 T G T Variant in one Allele
    hu040C0A chr13 T G G Variant in both Alleles
    huF1DC30 chr13 = ? ? Not Sure
    Insight: What do we expect to see Guanine
    Thymine

    View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. Where’s your Data? Is it in the right place? Is it in the right format?
    What questions do you
    want to ask?
    Extract, Transform,
    Load
    From Data to Meaning
    Data Meaning

    View Slide

  36. Coming Soon
    Cloud Pub/Sub Cloud Dataflow
    Apache Spark
    Images by Connie Zhou
    • bdutil support
    • Machine learning via
    MLlib
    • SQL via Shark
    • Managed Service
    • many-to-many, async
    messaging between
    apps
    • Limited Preview
    • Managed Service
    • Unified Batch and
    Streaming
    • Automatic Pipeline
    Optimization
    • Private Beta

    View Slide

  37. Hadoop on Google Cloud Platform - http://goo.gl/33PoKG
    (Covers the Hadoop Connectors and bdutil)
    Personal Genome Project - http://goo.gl/j1oTTF
    Cloud Developers console - https://console.developers.google.com
    Google BigQuery - https://developers.google.com/bigquery
    BigQuery Console - https://bigquery.cloud.google.com
    Resources

    View Slide

  38. End

    View Slide