Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Data to Meaning (Cloud Developer Roadshow 2014)

From Data to Meaning (Cloud Developer Roadshow 2014)

Here we talk about the tools and methodologies that you can use to gain insight from your data using a specific example from the Personal Genome Project. This deck was delivered during the Google Cloud Platform Developer Roadshow events in 2014.



August 20, 2014


  1. From Data to Meaning A step by step guide Google

    Cloud Platform Developer Roadshow - 2014
  2. Hadoop Cluster Anyone? Images by Connie Zhou $ ./bdutil deploy

  3. Mapreduce A distributed computing paradigm that consists of: • A

    map step, performed on subsets of the input data • A reduce step, that combines the output together A Mapreduce can run on structured or unstructured data Hadoop An Open Source implementation of Mapreduce Some Definitions Apache™ Hadoop®
  4. Google Cloud Storage BigQuery and Datastore Connectors Hadoop BigQuery Connector

    Datastore Connector Cloud Storage Connector BigQuery Datastore
  5. Deployment, Configuration, and Toolkits bdutil - Thin wrapper around Cloud

    SDK command-line tools • Deploy Hadoop cluster on-demand • Installs and configures GCS connector • Extensible - add scripts to run during deployment Hadoop Master bdutil Hadoop Workers Compute Engine API
  6. Let’s talk briefly about Big Data... Courtesy of Nimbus Ninety

    – www.nimbusninety.com
  7. • Personalised insights into your customer base ◦ Loyalty Cards,

    In App Purchases, Point of Sale offers • User retention activities ◦ For Games, Mobile Apps, anything that has pattern based usage • Predicting health problems • Website optimization through analytics • Enabling future breakthroughs in biology and medicine • Autonomous Traffic Lights and Flying Cars Some examples
  8. https://developers.google.com/genomics/ Source: iStockPhoto

  9. Your Your Business

  10. Personal Genome Project Where is your data? Genomic, Environmental, and

    Human Trait data
  11. Sometimes you just need to do something simple… but you

    need to do a lot of it! Source: iStockPhoto
  12. How Big is a Human Genome? Images by Connie Zhou

    $ gsutil ls gs://pgp-harvard-data-public/hu*/*/*/*/ASM/master* gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669- DNA_B05/ASM/masterVarBeta-GS000015172-ASM.tsv.bz2 gs://pgp-harvard-data-public/hu016B28/GS000018110-DID/GS000014561-ASM/GS01669- DNA_H03/ASM/masterVarBeta-GS000014561-ASM.tsv.bz2 + 173 more… $ wc masterVarBeta-GS000015172-ASM.tsv 17615429 306261586 2349556328 masterVarBeta-GS000015172-ASM.tsv One File Per Genome Participant Identifier Buried in Object Name
  13. What we have isn’t what we want Images by Connie

    Zhou $ head -n 30 masterVarBeta-GS000015172-ASM.tsv | cut -f -10 #ASSEMBLY_ID GS000015172-ASM #CNV_DIPLOID_WINDOW_WIDTH 2000 #CNV_NONDIPLOID_WINDOW_WIDTH 100000 [...] >locus ploidy chromosome begin end zygosity varType reference allele1Seq allele2Seq 1 2 chr1 0 10000 no-call no-ref = ? ? 2 2 chr1 10000 10476 no-call complex = ? ? 3 2 chr1 10476 10481 hom ref = = = 4 2 chr1 10481 10518 no-call complex = ? ? [...]
  14. Wheelbarrow or Giant Truck? Images by Connie Zhou $ ./cgi-mapper.py

    ./hu011C57/masterVarBeta-GS000015172-ASM.tsv demo-m$ ./hadoop-install/bin/hadoop jar \ ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \ -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \ -mapper cgi-mapper.py \ -file cgi-mapper.py \ --numReduceTasks 0 \ -output gs://big-data-roadshow/output
  15. None
  16. None
  17. None
  18. Goodbye Cluster! Images by Connie Zhou $ ./bdutil delete

  19. 434 GB, 3 billion rows bdutil deploy: ~3.5 minutes hadoop

    job: ~7 minutes bdutil delete: ~1.5 minutes Total time with manual entry: ~14 minutes n1-standard-16 instances: $1.16 / hour But what did it cost?
  20. Compute: $5.68 But what did it cost? What about storage?

    434 GB @ $0.026 GB / month $0.38 / day $11.28 / month What about networking? No Charge!
  21. But there’s more...

  22. Google BigQuery

  23. BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and

    Scale • Scan multiple TB’s in seconds • Interactive query performance • No limits on amount of data Ease of Use and Adoption • No administration / provisioning • Convenience of SQL • Open interfaces (REST, WebUI, ODBC) • First 1 TB of data processed per month is free Advanced “Big Data” Storage • Familiar database structure • Easy data management and ACL’s • Fast, atomic imports
  24. None
  25. None
  26. None
  27. None
  28. http://www.nih.gov/news/health/may2014/ninds-09.htm

  29. http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs9536314

  30. Subject chromosome reference allele1Seq allele2Seq Note huCCAFD0 chr13 = =

    = No Variant huE58004 chr13 T G T Variant in one Allele hu040C0A chr13 T G G Variant in both Alleles huF1DC30 chr13 = ? ? Not Sure Insight: What do we expect to see Guanine Thymine
  31. None
  32. None
  33. None
  34. None
  35. Where’s your Data? Is it in the right place? Is

    it in the right format? What questions do you want to ask? Extract, Transform, Load From Data to Meaning Data Meaning
  36. Coming Soon Cloud Pub/Sub Cloud Dataflow Apache Spark Images by

    Connie Zhou • bdutil support • Machine learning via MLlib • SQL via Shark • Managed Service • many-to-many, async messaging between apps • Limited Preview • Managed Service • Unified Batch and Streaming • Automatic Pipeline Optimization • Private Beta
  37. Hadoop on Google Cloud Platform - http://goo.gl/33PoKG (Covers the Hadoop

    Connectors and bdutil) Personal Genome Project - http://goo.gl/j1oTTF Cloud Developers console - https://console.developers.google.com Google BigQuery - https://developers.google.com/bigquery BigQuery Console - https://bigquery.cloud.google.com Resources
  38. End