Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
The New Genomics
[email protected]
Dr. Matt Wood
Slide 2
Slide 2 text
Hello
Slide 3
Slide 3 text
Hello
Slide 4
Slide 4 text
Data
Slide 5
Slide 5 text
DNA
Slide 6
Slide 6 text
Chromosome 11 : ACTN3 : rs1815739
Slide 7
Slide 7 text
Chromosome X : rs6625163
Slide 8
Slide 8 text
Chromosome 19 : FUT2 : rs601338
Slide 9
Slide 9 text
+0.25 Chromosome 15 : rs2472297
Slide 10
Slide 10 text
Chromosome 2 : rs10427255
Slide 11
Slide 11 text
TYPE II Chromosome 10 : rs7903146
Slide 12
Slide 12 text
Chromosome 1 : rs4481887
Slide 13
Slide 13 text
I know this, because...
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
A T C G G T C C A G G
Slide 16
Slide 16 text
A T C G G T C C A G G A G C C A G G U C C Transcription
Slide 17
Slide 17 text
A T C G G T C C A G G A G C C A G G U C C Translation Ser Glu Val Transcription
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
Chromosome 11 : ACTN3 : rs1815739
Slide 21
Slide 21 text
Chromosome X : rs6625163
Slide 22
Slide 22 text
Chromosome 19 : FUT2 : rs601338
Slide 23
Slide 23 text
+0.25 Chromosome 15 : rs2472297
Slide 24
Slide 24 text
Chromosome 2 : rs10427255
Slide 25
Slide 25 text
TYPE II Chromosome 10 : rs7903146
Slide 26
Slide 26 text
Chromosome 1 : rs4481887
Slide 27
Slide 27 text
I know all that, because...
Slide 28
Slide 28 text
Human Genome Project
Slide 29
Slide 29 text
40 species ensembl.org
Slide 30
Slide 30 text
Compare species
Slide 31
Slide 31 text
Biological importance
Slide 32
Slide 32 text
Step change
Slide 33
Slide 33 text
Less time. Lower cost.
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
No content
Slide 36
Slide 36 text
Compare individuals
Slide 37
Slide 37 text
No content
Slide 38
Slide 38 text
Data generation costs are falling (pretty much everywhere)
Slide 39
Slide 39 text
Sequencing challenge X
Slide 40
Slide 40 text
Amazona vittata
Slide 41
Slide 41 text
Analytics challenge
Slide 42
Slide 42 text
Lots of data, Lots of uses, Lots of users, Lots of locations
Slide 43
Slide 43 text
Cost
Slide 44
Slide 44 text
Analytics challenge X
Slide 45
Slide 45 text
Accessibility challenge
Slide 46
Slide 46 text
The New Genomics
Slide 47
Slide 47 text
Graceful. Beautiful.
Slide 48
Slide 48 text
Impossible to re-create
Slide 49
Slide 49 text
Snowflake Science
Slide 50
Slide 50 text
Reproducibility
Slide 51
Slide 51 text
Reproducibility scales science
Slide 52
Slide 52 text
Reproduce. Reuse. Remix.
Slide 53
Slide 53 text
Value++
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
How do we get from here to there? 5PRINCIPLES REPRODUCIBILITY OF
Slide 56
Slide 56 text
1. Use the gravity of data 5 PRINCIPLES REPRODUCIBILITY OF
Slide 57
Slide 57 text
Increasingly large data collections
Slide 58
Slide 58 text
1000 Genomes Project: 200Tb
Slide 59
Slide 59 text
Challenging to obtain and manage
Slide 60
Slide 60 text
Expensive to experiment
Slide 61
Slide 61 text
Large barrier to reproducibility
Slide 62
Slide 62 text
Data size will increase
Slide 63
Slide 63 text
Data integration will increase
Slide 64
Slide 64 text
Move data to the users
Slide 65
Slide 65 text
Move data to the users X
Slide 66
Slide 66 text
Move tools to the data
Slide 67
Slide 67 text
Place data where it can consumed by tools
Slide 68
Slide 68 text
Place tools where they can access data
Slide 69
Slide 69 text
No content
Slide 70
Slide 70 text
No content
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
Canonical source
Slide 73
Slide 73 text
No content
Slide 74
Slide 74 text
More data, more users, more uses, more locations
Slide 75
Slide 75 text
Cost and complexity
Slide 76
Slide 76 text
Cost and complexity kill reproducibility
Slide 77
Slide 77 text
Utility computing
Slide 78
Slide 78 text
Availability
Slide 79
Slide 79 text
Intel Xeon E5 NVIDIA Tesla GPUs
Slide 80
Slide 80 text
90 - 120k IOPS on SSDs
Slide 81
Slide 81 text
Pay-as-you-go
Slide 82
Slide 82 text
100% Reserved capacity
Slide 83
Slide 83 text
100% Reserved capacity On-demand
Slide 84
Slide 84 text
100% Reserved capacity On-demand
Slide 85
Slide 85 text
Spot instances
Slide 86
Slide 86 text
Name-your-price
Slide 87
Slide 87 text
No content
Slide 88
Slide 88 text
2. Ease of use is a pre-requisite 5 PRINCIPLES REPRODUCIBILITY OF
Slide 89
Slide 89 text
http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html
Slide 90
Slide 90 text
Help overcome the suck threshold
Slide 91
Slide 91 text
Easy to embrace and extend
Slide 92
Slide 92 text
Choose the right abstraction for the user
Slide 93
Slide 93 text
$ ec2-run-instances
Slide 94
Slide 94 text
$ starcluster start
Slide 95
Slide 95 text
No content
Slide 96
Slide 96 text
No content
Slide 97
Slide 97 text
Package and automate
Slide 98
Slide 98 text
Package and automate Amazon machine images, VM import
Slide 99
Slide 99 text
Package and automate Amazon machine images, VM import Deployment scripts, CloudFormation, Chef, Puppet
Slide 100
Slide 100 text
Expert-as-a-service
Slide 101
Slide 101 text
No content
Slide 102
Slide 102 text
No content
Slide 103
Slide 103 text
1000 Genomes Cloud BioLinux
Slide 104
Slide 104 text
No content
Slide 105
Slide 105 text
Your HiSeq data Illumina BaseSpace
Slide 106
Slide 106 text
DNA and RNA sequences Genomespace, Broad Institute at MIT
Slide 107
Slide 107 text
Data as a programmable resource
Slide 108
Slide 108 text
3. Reuse is as important as reproduction 5 PRINCIPLES REPRODUCIBILITY OF
Slide 109
Slide 109 text
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Slide 110
Slide 110 text
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Slide 111
Slide 111 text
Infonauts are hackers
Slide 112
Slide 112 text
They have their own way of working
Slide 113
Slide 113 text
The ‘Big Red Button’
Slide 114
Slide 114 text
Fire and forget reproduction is a good first step, but limits longer term value.
Slide 115
Slide 115 text
Monolithic, one-stop-shop
Slide 116
Slide 116 text
Work well for intended purpose
Slide 117
Slide 117 text
Challenging to install, dependency heavy
Slide 118
Slide 118 text
Inflexible
Slide 119
Slide 119 text
Embrace infonauts as hackers
Slide 120
Slide 120 text
Small things. Loosely coupled.
Slide 121
Slide 121 text
Easier to reuse
Slide 122
Slide 122 text
Easier to integrate
Slide 123
Slide 123 text
Scale out
Slide 124
Slide 124 text
Cancer drug discovery: 50,000 cores < $1000 an hour Schrödinger and CycleServer
Slide 125
Slide 125 text
4. Build for collaboration 5 PRINCIPLES REPRODUCIBILITY OF
Slide 126
Slide 126 text
Workflows are memes
Slide 127
Slide 127 text
Reproduction is just the first step
Slide 128
Slide 128 text
Bill of materials: code, data, configuration, infrastructure
Slide 129
Slide 129 text
Full definition for reproduction
Slide 130
Slide 130 text
Utility computing provides a playground for data science
Slide 131
Slide 131 text
Code + AMI + custom datasets + public datasets + databases + compute + result data
Slide 132
Slide 132 text
Code + AMI + custom datasets + public datasets + databases + compute + result data
Slide 133
Slide 133 text
Code + AMI + custom datasets + public datasets + databases + compute + result data
Slide 134
Slide 134 text
Code + AMI + custom datasets + public datasets + databases + compute + result data
Slide 135
Slide 135 text
Package, automate, contribute.
Slide 136
Slide 136 text
Utility platform provides scale for production runs
Slide 137
Slide 137 text
5. Provenance is a first class object 5 PRINCIPLES REPRODUCIBILITY OF
Slide 138
Slide 138 text
Versioning becomes really important
Slide 139
Slide 139 text
Especially in an active community
Slide 140
Slide 140 text
Doubly so with loosely coupled tools
Slide 141
Slide 141 text
Provenance metadata is a first class entity
Slide 142
Slide 142 text
Distributed provenance
Slide 143
Slide 143 text
5PRINCIPLES REPRODUCIBILITY OF
Slide 144
Slide 144 text
Remove constraints 5PRINCIPLES REPRODUCIBILITY OF
Slide 145
Slide 145 text
Accelerate science 5PRINCIPLES REPRODUCIBILITY OF
Slide 146
Slide 146 text
Chromosome 11 : ACTN3 : rs1815739
Slide 147
Slide 147 text
Chromosome X : rs6625163
Slide 148
Slide 148 text
Chromosome 19 : FUT2 : rs601338
Slide 149
Slide 149 text
+0.25 Chromosome 15 : rs2472297
Slide 150
Slide 150 text
Chromosome 2 : rs10427255
Slide 151
Slide 151 text
TYPE II Chromosome 10 : rs7903146
Slide 152
Slide 152 text
Chromosome 1 : rs4481887
Slide 153
Slide 153 text
Thank you aws.amazon.com @mza
[email protected]