Slide 1

Slide 1 text

From Data to Meaning A step by step guide Google Cloud Platform Developer Roadshow - 2014

Slide 2

Slide 2 text

Hadoop Cluster Anyone? Images by Connie Zhou $ ./bdutil deploy

Slide 3

Slide 3 text

Mapreduce A distributed computing paradigm that consists of: ● A map step, performed on subsets of the input data ● A reduce step, that combines the output together A Mapreduce can run on structured or unstructured data Hadoop An Open Source implementation of Mapreduce Some Definitions Apache™ Hadoop®

Slide 4

Slide 4 text

Google Cloud Storage BigQuery and Datastore Connectors Hadoop BigQuery Connector Datastore Connector Cloud Storage Connector BigQuery Datastore

Slide 5

Slide 5 text

Deployment, Configuration, and Toolkits bdutil - Thin wrapper around Cloud SDK command-line tools • Deploy Hadoop cluster on-demand • Installs and configures GCS connector • Extensible - add scripts to run during deployment Hadoop Master bdutil Hadoop Workers Compute Engine API

Slide 6

Slide 6 text

Let’s talk briefly about Big Data... Courtesy of Nimbus Ninety – www.nimbusninety.com

Slide 7

Slide 7 text

● Personalised insights into your customer base ○ Loyalty Cards, In App Purchases, Point of Sale offers ● User retention activities ○ For Games, Mobile Apps, anything that has pattern based usage ● Predicting health problems ● Website optimization through analytics ● Enabling future breakthroughs in biology and medicine ● Autonomous Traffic Lights and Flying Cars Some examples

Slide 8

Slide 8 text

https://developers.google.com/genomics/ Source: iStockPhoto

Slide 9

Slide 9 text

Your Your Business

Slide 10

Slide 10 text

Personal Genome Project Where is your data? Genomic, Environmental, and Human Trait data

Slide 11

Slide 11 text

Sometimes you just need to do something simple… but you need to do a lot of it! Source: iStockPhoto

Slide 12

Slide 12 text

How Big is a Human Genome? Images by Connie Zhou $ gsutil ls gs://pgp-harvard-data-public/hu*/*/*/*/ASM/master* gs://pgp-harvard-data-public/hu011C57/GS000018120-DID/GS000015172-ASM/GS01669- DNA_B05/ASM/masterVarBeta-GS000015172-ASM.tsv.bz2 gs://pgp-harvard-data-public/hu016B28/GS000018110-DID/GS000014561-ASM/GS01669- DNA_H03/ASM/masterVarBeta-GS000014561-ASM.tsv.bz2 + 173 more… $ wc masterVarBeta-GS000015172-ASM.tsv 17615429 306261586 2349556328 masterVarBeta-GS000015172-ASM.tsv One File Per Genome Participant Identifier Buried in Object Name

Slide 13

Slide 13 text

What we have isn’t what we want Images by Connie Zhou $ head -n 30 masterVarBeta-GS000015172-ASM.tsv | cut -f -10 #ASSEMBLY_ID GS000015172-ASM #CNV_DIPLOID_WINDOW_WIDTH 2000 #CNV_NONDIPLOID_WINDOW_WIDTH 100000 [...] >locus ploidy chromosome begin end zygosity varType reference allele1Seq allele2Seq 1 2 chr1 0 10000 no-call no-ref = ? ? 2 2 chr1 10000 10476 no-call complex = ? ? 3 2 chr1 10476 10481 hom ref = = = 4 2 chr1 10481 10518 no-call complex = ? ? [...]

Slide 14

Slide 14 text

Wheelbarrow or Giant Truck? Images by Connie Zhou $ ./cgi-mapper.py ./hu011C57/masterVarBeta-GS000015172-ASM.tsv demo-m$ ./hadoop-install/bin/hadoop jar \ ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \ -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \ -mapper cgi-mapper.py \ -file cgi-mapper.py \ --numReduceTasks 0 \ -output gs://big-data-roadshow/output

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Goodbye Cluster! Images by Connie Zhou $ ./bdutil delete

Slide 19

Slide 19 text

434 GB, 3 billion rows bdutil deploy: ~3.5 minutes hadoop job: ~7 minutes bdutil delete: ~1.5 minutes Total time with manual entry: ~14 minutes n1-standard-16 instances: $1.16 / hour But what did it cost?

Slide 20

Slide 20 text

Compute: $5.68 But what did it cost? What about storage? 434 GB @ $0.026 GB / month $0.38 / day $11.28 / month What about networking? No Charge!

Slide 21

Slide 21 text

But there’s more...

Slide 22

Slide 22 text

Google BigQuery

Slide 23

Slide 23 text

BigQuery: Big Data Analytics in the Cloud Unrivaled Performance and Scale ● Scan multiple TB’s in seconds ● Interactive query performance ● No limits on amount of data Ease of Use and Adoption ● No administration / provisioning ● Convenience of SQL ● Open interfaces (REST, WebUI, ODBC) ● First 1 TB of data processed per month is free Advanced “Big Data” Storage ● Familiar database structure ● Easy data management and ACL’s ● Fast, atomic imports

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

http://www.nih.gov/news/health/may2014/ninds-09.htm

Slide 29

Slide 29 text

http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs9536314

Slide 30

Slide 30 text

Subject chromosome reference allele1Seq allele2Seq Note huCCAFD0 chr13 = = = No Variant huE58004 chr13 T G T Variant in one Allele hu040C0A chr13 T G G Variant in both Alleles huF1DC30 chr13 = ? ? Not Sure Insight: What do we expect to see Guanine Thymine

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Where’s your Data? Is it in the right place? Is it in the right format? What questions do you want to ask? Extract, Transform, Load From Data to Meaning Data Meaning

Slide 36

Slide 36 text

Coming Soon Cloud Pub/Sub Cloud Dataflow Apache Spark Images by Connie Zhou • bdutil support • Machine learning via MLlib • SQL via Shark • Managed Service • many-to-many, async messaging between apps • Limited Preview • Managed Service • Unified Batch and Streaming • Automatic Pipeline Optimization • Private Beta

Slide 37

Slide 37 text

Hadoop on Google Cloud Platform - http://goo.gl/33PoKG (Covers the Hadoop Connectors and bdutil) Personal Genome Project - http://goo.gl/j1oTTF Cloud Developers console - https://console.developers.google.com Google BigQuery - https://developers.google.com/bigquery BigQuery Console - https://bigquery.cloud.google.com Resources

Slide 38

Slide 38 text

End