Introducing five principles for reproducibility.
Scaling Science[email protected]Dr. Matt Wood
View Slide
Hello
Science
Beautiful, unique.
Impossible to re-create
Snowflake Science
Reproducibility
Reproducibility scales science
Reproduce. Reuse. Remix.
Value++
How do we get fromhere to there?5PRINCIPLESREPRODUCIBILITYOF
1. Data has Gravity5 PRINCIPLESREPRODUCIBILITYOF
Increasingly large datacollections
1000 Genomes Project: 200Tb
Challenging to obtain and manage
Expensive to experiment
Large barrier to reproducibility
Data size will increase
Data integration will increase
Data dependencies will increase
Move data to the users
Move data to the usersX
Move tools to the data
Place data where it canconsumed by tools
Place tools where theycan access data
Canonical source
More data,more users,more uses,more locations
Cost
Force multiplier
Complexity
Cost and complexitykill reproducibility
Utility computing
Availability
Pay-as-you-go
Flexibility
Performance
CPU + IO
Intel Xeon E5NVIDIA Tesla GPUs
240 TFLOPS
90 - 120k IOPS on SSDs
Performance through productivity
On-demand access
Reserved capacity
100%Reserved capacity
100%Reserved capacityOn-demand
Spot instances
Utility computing enhancedreproducibility
2. Ease of use is a pre-requisite5 PRINCIPLESREPRODUCIBILITYOF
http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html
Help overcome the suck threshold
Easy to embrace and extend
Choose the right abstractionfor the user
$ ec2-run-instances
$ starcluster start
Package and automate
Package and automateAmazon machine images,VM import
Package and automateAmazon machine images,VM importDeployment scripts,CloudFormation, Chef, Puppet
Expert-as-a-service
1000 GenomesCloud BioLinux
Your HiSeq dataIllumina BaseSpace
Architectural freedom
Freedom of abstraction
3. Reuse is as important asreproduction5 PRINCIPLESREPRODUCIBILITYOF
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics
Infonauts are hackers
They have their own way ofworking
The ‘Big Red Button’
Fire and forget reproductionis a good first step, but limitslonger term value.
Monolithic, one-stop-shop
Work well for intended purpose
Challenging to install,dependency heavy
Di cult to grok
Inflexible
Infonauts are hackers:embrace it.
Small things. Loosely coupled.
Easier to grok
Easier to reuse
Easier to integrate
Lower barrier to entry
Scale out
Build for reuse.Be remix friendly.Maximize value.
4. Build for collaboration5 PRINCIPLESREPRODUCIBILITYOF
Workflows are memes
Reproduction is just the first step
Bill of materials:code, data, configuration,infrastructure
Full definition for reproduction
Utility computing provides aplayground for bioinformatics
Code + AMI +custom datasets + public datasets +databases + compute + result data
Package, automate, contribute.
Utility platform providesscale for production runs
Drug discovery on 50k cores:Less than $1000
5. Provenance is a first class object5 PRINCIPLESREPRODUCIBILITYOF
Versioning becomes really important
Especially in an active community
Doubly so with loosely coupled tools
Provenance metadata is afirst class entity
Distributed provenance
1. Data has gravity2. Ease of use is a pre-requisite3. Reuse is as important as reproduction4. Build for collaboration5. Provenance is a first class object5PRINCIPLESREPRODUCIBILITYOF
Thank youaws.amazon.com@mza[email protected]