Slide 1

Slide 1 text

Reproducible Phylogenetics Dave Lunt, Amir Szitenberg, Max John, Mark Blaxter slides available: speakerdeck.com/davelunt software: http://hulluni-bioinformatics.github.io/ReproPhylo

Slide 2

Slide 2 text

How can I do this? Reproducible Phylogenetics talk outline in questions Whats wrong with phylogenetics now? What are advantages of reproducibility to me?

Slide 3

Slide 3 text

Genomics is going to break it Whats wrong with phylogenetics now? Lack of reproducibility is a problem We don’t take advantage of computing environment advances my view is I’ll explain……

Slide 4

Slide 4 text

Phylogenetics is everywhere

Slide 5

Slide 5 text

Phylogenetics is everywhere Pubmed has ~100,000 articles with phylog* in title / abstract in >700 different journals

Slide 6

Slide 6 text

Phylogenetics is everywhere We are in the new age of phylogenomics a scale of data we are badly prepared to analyse

Slide 7

Slide 7 text

We are in the new age of phylogenomics a scale of data we are badly prepared to analyse Algorithm bottlenecks Human bottlenecks

Slide 8

Slide 8 text

What is reproducible phylogenetics and why is it important?

Slide 9

Slide 9 text

Can you get their data?

Slide 10

Slide 10 text

Can you get their data? Alignments as well as raw data?

Slide 11

Slide 11 text

Can you get their data? Trees? ((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal: 12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel: 18.87953):2.09460):3.87382,dog:25.46154);

Slide 12

Slide 12 text

Can you get their data? Do alignments & trees share taxon names with figures?

Slide 13

Slide 13 text

Can you get their software?

Slide 14

Slide 14 text

Can you get their software? Do you know which version they used?

Slide 15

Slide 15 text

Can you get their software? Do you know the parameters they ran?

Slide 16

Slide 16 text

Can you exactly reproduce the figures from their paper? or are figures just pictures of results rather than results

Slide 17

Slide 17 text

If you can’t reproduce the work is it science? Science is iterative, building on previous work

Slide 18

Slide 18 text

“If I have seen further it is by standing on the shoulders of giants” Isaac Newton

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Reproducibility is a very hot topic in bioinformatics but has had little influence on phylogenetics

Slide 21

Slide 21 text

It is likely to be compulsory in the future Reproducibility is the right thing to do

Slide 22

Slide 22 text

What are the advantages of reproducibility to me?

Slide 23

Slide 23 text

Reproducibility will make your life much easier The rest of the talk looks at advantages to you

Slide 24

Slide 24 text

Reproducibility will make your life much easier Who will replicate your analysis?

Slide 25

Slide 25 text

Reproducibility will make your life much easier Who will replicate your analysis? Future you!

Slide 26

Slide 26 text

Reproducibility will make your life much easier Hinders reproducibility Does not scale manual data processing is ‘old phylogenetics’ widespread programmatic approaches are required

Slide 27

Slide 27 text

Reproducibility will make your life much easier Current phylogenetics is not experimental How often have you tested the effect of Clustal parameter choices?

Slide 28

Slide 28 text

Reproducible scripted ‘pipelines’ are inherently experimental experimental phylogenetics

Slide 29

Slide 29 text

Reproducibility leads to experimental phylogenetics support gap trimming ‘relaxedness’ a synthetic example: tree replicates built from alignments constructed with 10 different alignment parameters

Slide 30

Slide 30 text

What is minimally required for reproducibility?

Slide 31

Slide 31 text

What is minimally required for reproducibility? Should we really be aiming for minimal? archive it all

Slide 32

Slide 32 text

Computational pipelines make complete reproducibility as easy as minimal reproducibility Only human users are concerned with minimal reproducibility

Slide 33

Slide 33 text

Computational pipelines make this trivial All these things are done automatically “Frictionless” reproducibility How do I do this? Reproducible phylogenetics All these challenges are solved-problems for computer scientists

Slide 34

Slide 34 text

ReproPhylo reproducible phylogenetics environment v1.0

Slide 35

Slide 35 text

• Open phylogenetics environment • Uses standards • Frictionless reproducibility • Platform independent • Fast ReproPhylo Software: http://hulluni-bioinformatics.github.io/ReproPhylo v1.0 Users welcome! Manual: http://goo.gl/aZeRXf

Slide 36

Slide 36 text

ReproPhylo is an environment and approach not phylogenetic tree building software GenBank sequences and metadata Your sequences, alignments, trees Your metadata

Slide 37

Slide 37 text

Automatic archiving of ALL Text report of all actions, analyses and results trees, alignments, sequences, metadata, provenance, methods & journal friendly zip files html electronic lab notebook automatically written, ease to browse Copy and paste Methods section for journals ReproPhylo is an environment and approach not phylogenetic tree building software

Slide 38

Slide 38 text

ReproPhylo runs in user- friendly IPython notebook Analysis pipelines provided Edit to specify your data, and modify any parameters you wish, then run, inspect, repeat

Slide 39

Slide 39 text

ReproPhylo runs in user- friendly IPython notebook Mixture of user manual & analysis framework change a parameter and hit Run

Slide 40

Slide 40 text

code output Exploratory Data Analysis example

Slide 41

Slide 41 text

Exploratory Data Analysis check this? real data Dunn et al 2008 doi:10.1038/nature06614

Slide 42

Slide 42 text

Exploratory Data Analysis Dunn et al 2008 doi:10.1038/nature06614 real data

Slide 43

Slide 43 text

Meta data is retained tree can be labelled, or stat test done, with any data that can be harvested from original genbank file (or any other associated data file) sponge tree with morphological annotations at tips

Slide 44

Slide 44 text

Electronic lab book Pipeline writes a human-readable text/html file documenting the experiment and outcomes including Methods section Data provenance and version control included Easy archiving for journal submission

Slide 45

Slide 45 text

ReproPhylo writes very extensive Results automatically alignment statistics

Slide 46

Slide 46 text

Allows experimental hypothesis- testing phylogenomics ReproPhylo opens new doors ReproPhylo ReproPhylo is environment & approach not tree building algorithm more than reproducibility

Slide 47

Slide 47 text

ReproPhylo and molecular evolution Similar approach gives reproducible, comparative evolutionary genomics Amir Szitenberg Comparative genomics of transposon evolution Friday 11.20

Slide 48

Slide 48 text

Reproducible Phylogenetics Dave Lunt, Amir Szitenberg, Max John, Mark Blaxter ReproPhylo slides available: speakerdeck.com/davelunt software: http://hulluni-bioinformatics.github.io/ReproPhylo