Slide 1

Slide 1 text

There is no magic, only awesome
 Building a platform for scientific data analysis D e e p a k S i n g h ( @ m n d o c i ) P r i n c i p a l P r o d u c t M a n a g e r - A m a z o n E C 2

Slide 2

Slide 2 text

Good morning!

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

bioinformatics

Slide 7

Slide 7 text

Collection

Slide 8

Slide 8 text

Curation

Slide 9

Slide 9 text

Analysis

Slide 10

Slide 10 text

so what?

Slide 11

Slide 11 text

Image: Yael Fitzpatrick (AAAS)

Slide 12

Slide 12 text

lots of data

Slide 13

Slide 13 text

lots of people

Slide 14

Slide 14 text

lots of places

Slide 15

Slide 15 text

constant change

Slide 16

Slide 16 text

we want to make our data more effective

Slide 17

Slide 17 text

versioning

Slide 18

Slide 18 text

provenance

Slide 19

Slide 19 text

filter Via asklar under a CC-BY license

Slide 20

Slide 20 text

Image: Chris Heiler aggregate

Slide 21

Slide 21 text

extend Image: Bethan

Slide 22

Slide 22 text

human interfaces Image: Sebastian Anthony

Slide 23

Slide 23 text

share

Slide 24

Slide 24 text

image: Leo Reynolds communicate

Slide 25

Slide 25 text

hard problem

Slide 26

Slide 26 text

^ Really hard problem

Slide 27

Slide 27 text

How can we be more effective with data?

Slide 28

Slide 28 text

Wrong question

Slide 29

Slide 29 text

How can we be more effective with our people?

Slide 30

Slide 30 text

text Image: Pieter Musterd Remove constraints

Slide 31

Slide 31 text

Move data to the users?

Slide 32

Slide 32 text

Move tools to the data

Slide 33

Slide 33 text

Give tools access to data

Slide 34

Slide 34 text

Give data access to tools

Slide 35

Slide 35 text

APIs

Slide 36

Slide 36 text

so how do 
 get there?

Slide 37

Slide 37 text

information platforms

Slide 38

Slide 38 text

the unreasonable effectiveness of data Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)

Slide 39

Slide 39 text

accept all data formats

Slide 40

Slide 40 text

evolve APIs

Slide 41

Slide 41 text

data is a programmable resource

Slide 42

Slide 42 text

compute is a 
 fungible commodity

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Amazon EC2

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

ec2-run-instances

Slide 47

Slide 47 text

programmable

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

Building Blocks

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

Genomic  data  is  assembled  and  analyzed  in  Complete   Genomics’  data  center  and  then  is  securely  transferred  over  a   dedicated  network  to  Amazon  Web  Services  (AWS)  for  delivery   to  customers  either  by  shipping  hard  disk  drives  or   electronically. Automatic  transfer  of  raw  data  to  the  cloud  in  real  time  during  the   course  of  the  sequencing  run   http://blog.basespace.illumina.com/2012/08/10/basespace-­‐growth-­‐the-­‐ numbers/     • 70%  of  all  installed  MiSeqs  have  connected  to  BaseSpace   • BaseSpace  on  HiSeq  in  Q4  2012 Sequencing data going straight to the cloud

Slide 55

Slide 55 text

Analyze data in the cloud

Slide 56

Slide 56 text

Computational compound analysis
 Solar panel material 
 Estimated computation time 264 years
 
 156,314 core cluster across 8 regions 1.21 petaFLOPS (Rpeak) Simulated 205,000 materials 18 hours for $33,000 16¢ per molecule 1 c|net news ! http://news.cnet.com/8301-1001_3-57611919-92/supercomputing-simulation-employs-156000-amazon-processor-cores/

Slide 57

Slide 57 text

Migrated clinical trials simulations platform Simulations in 1.2hrs vs. 60hrs! 64% reduction in costs Clinical Pharmacology & Pharmacometrics Molecular Dynamics Computational Genomics Research Application Portfolio 98% time saved for clinical trial simulations Internal System Cloud Individual Clinical Trial Simulation Run Time (Min) 56 56 Total Number of Clinical Trial Simulations 2000 2000 No. Servers 2 256 No. CPU’s 32 2048 Total Analysis Run Time (hr) 60 1.2 Cost ?? $336

Slide 58

Slide 58 text

CHARGE Consortium
 Aimed at better understanding how human genetics contributes 
 to heart disease and aging

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

Collaborating around data: National Database for Autism Research

Slide 61

Slide 61 text

• An online environment to drive collaboration among researchers • Synapse hosts clinical- genomic datasets • Provides a shared compute space and suite of analysis tools for researchers

Slide 62

Slide 62 text

Global Collaboration for Global Manufacturing Cloud provides a global, distributed, secure, and scalable environment for collaborative design and manufacturing

Slide 63

Slide 63 text

Reproduce. Reuse. Remix

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

https://github.com/awslabs/cfncluster

Slide 66

Slide 66 text

[plugin ipcluster] setup_class = ipcluster.IPCluster enable_notebook = true notebook_passwd = YOUR-PASS ! [cluster qiime]
 node_image_id = ami-2faa7346
 keyname = YOUR-KEY
 cluster_size = 4
 node_instance_type = m2.4xlarge
 plugins = ipcluster ! $ starcluster start -c qiime myqiime Source: Justin Riley

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

1000 Genomes

Slide 69

Slide 69 text

1000 Genomes Cloud BioLinux

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

Your HiSeq data Illumina BaseSpace

Slide 72

Slide 72 text

Your data GenomeSpace

Slide 73

Slide 73 text

No content

Slide 74

Slide 74 text

Reproduce. Reuse. Remix

Slide 75

Slide 75 text

Image: Pieter Musterd Making people more effective

Slide 76

Slide 76 text

Thank You