Scaling Science

Scaling Science [email protected] Dr. Matt Wood

Science

Beautiful, unique.

Impossible to re-create

Snowﬂake Science

Reproducibility

Reproducibility scales science

Reproduce. Reuse. Remix.

Value++

How do we get from here to there? 5PRINCIPLES REPRODUCIBILITY
OF

1. Data has Gravity 5 PRINCIPLES REPRODUCIBILITY OF

Increasingly large data collections

1000 Genomes Project: 200Tb

Challenging to obtain and manage

Expensive to experiment

Large barrier to reproducibility

Data size will increase

Data integration will increase

Data dependencies will increase

Move data to the users

Move data to the users X

Move tools to the data

Place data where it can consumed by tools

Place tools where they can access data

Canonical source

More data, more users, more uses, more locations

Force multiplier

Complexity

Cost and complexity kill reproducibility

Utility computing

Availability

Pay-as-you-go

Flexibility

Performance

CPU + IO

Intel Xeon E5 NVIDIA Tesla GPUs

240 TFLOPS

90 - 120k IOPS on SSDs

Performance through productivity

On-demand access

Reserved capacity

100% Reserved capacity

100% Reserved capacity On-demand

Spot instances

Utility computing enhanced reproducibility

2. Ease of use is a pre-requisite 5 PRINCIPLES REPRODUCIBILITY
OF

http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.html

Help overcome the suck threshold

Easy to embrace and extend

Choose the right abstraction for the user

$ ec2-run-instances

$ starcluster start

Package and automate

Package and automate Amazon machine images, VM import

Package and automate Amazon machine images, VM import Deployment scripts,
CloudFormation, Chef, Puppet

Expert-as-a-service

1000 Genomes Cloud BioLinux

Your HiSeq data Illumina BaseSpace

Architectural freedom

Freedom of abstraction

3. Reuse is as important as reproduction 5 PRINCIPLES REPRODUCIBILITY
OF

Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformatics

Infonauts are hackers

They have their own way of working

The ‘Big Red Button’

Fire and forget reproduction is a good ﬁrst step, but
limits longer term value.

Monolithic, one-stop-shop

Work well for intended purpose

Challenging to install, dependency heavy

Di cult to grok

Inﬂexible

Infonauts are hackers: embrace it.

Small things. Loosely coupled.

Easier to grok

Easier to reuse

Easier to integrate

Lower barrier to entry

Scale out

Build for reuse. Be remix friendly. Maximize value.

4. Build for collaboration 5 PRINCIPLES REPRODUCIBILITY OF

Workﬂows are memes

Reproduction is just the ﬁrst step

Bill of materials: code, data, conﬁguration, infrastructure

Full deﬁnition for reproduction

Utility computing provides a playground for bioinformatics

Code + AMI + custom datasets + public datasets +
databases + compute + result data

Package, automate, contribute.

Utility platform provides scale for production runs

Drug discovery on 50k cores: Less than $1000

5. Provenance is a ﬁrst class object 5 PRINCIPLES REPRODUCIBILITY
OF

Versioning becomes really important

Especially in an active community

Doubly so with loosely coupled tools

Provenance metadata is a ﬁrst class entity

Distributed provenance

1. Data has gravity 2. Ease of use is a
pre-requisite 3. Reuse is as important as reproduction 4. Build for collaboration 5. Provenance is a ﬁrst class object 5PRINCIPLES REPRODUCIBILITY OF

Thank you aws.amazon.com @mza [email protected]

Scaling Science

Scaling Science

More Decks by Matt Wood

Other Decks in Science

Featured

Transcript