Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible Phylogenetics

Dave Lunt
January 08, 2015

Reproducible Phylogenetics

Given at 48th Population Genetics Group in Sheffield Jan 2015

Dave Lunt

January 08, 2015
Tweet

More Decks by Dave Lunt

Other Decks in Science

Transcript

  1. Reproducible
    Phylogenetics
    Dave Lunt, Amir Szitenberg, Max John, Mark Blaxter
    slides available: speakerdeck.com/davelunt
    software: http://hulluni-bioinformatics.github.io/ReproPhylo

    View Slide

  2. How can I do this?
    Reproducible Phylogenetics
    talk outline in questions
    Whats wrong with
    phylogenetics now?
    What are advantages of
    reproducibility to me?

    View Slide

  3. Genomics is going to break it
    Whats wrong with
    phylogenetics now?
    Lack of reproducibility is a problem
    We don’t take advantage of
    computing environment advances
    my view is
    I’ll explain……

    View Slide

  4. Phylogenetics is
    everywhere

    View Slide

  5. Phylogenetics is everywhere
    Pubmed has ~100,000
    articles with phylog* in
    title / abstract
    in >700 different journals

    View Slide

  6. Phylogenetics is everywhere
    We are in the new age of
    phylogenomics
    a scale of data we are badly
    prepared to analyse

    View Slide

  7. We are in the new age of
    phylogenomics
    a scale of data we are badly
    prepared to analyse
    Algorithm bottlenecks
    Human bottlenecks

    View Slide

  8. What is reproducible
    phylogenetics and why is
    it important?

    View Slide

  9. Can you get their
    data?

    View Slide

  10. Can you get their
    data?
    Alignments as well as
    raw data?

    View Slide

  11. Can you get their
    data?
    Trees?
    ((raccoon:19.19959,bear:6.80041):0.84600,((sea_lion:11.99700, seal:
    12.00300):7.52973,((monkey:100.85930,cat:47.14069):20.59201, weasel:
    18.87953):2.09460):3.87382,dog:25.46154);

    View Slide

  12. Can you get their
    data?
    Do alignments & trees
    share taxon names with
    figures?

    View Slide

  13. Can you get their
    software?

    View Slide

  14. Can you get their
    software?
    Do you know which
    version they used?

    View Slide

  15. Can you get their
    software?
    Do you know the
    parameters they ran?

    View Slide

  16. Can you exactly
    reproduce the figures
    from their paper?
    or are figures just pictures of results rather than results

    View Slide

  17. If you can’t reproduce
    the work is it science?
    Science is iterative, building on
    previous work

    View Slide

  18. “If I have seen further it is
    by standing on the
    shoulders of giants”
    Isaac Newton

    View Slide

  19. View Slide

  20. Reproducibility is a very
    hot topic in
    bioinformatics
    but has had little influence on
    phylogenetics

    View Slide

  21. It is likely to be
    compulsory in the future
    Reproducibility is the right
    thing to do

    View Slide

  22. What are the
    advantages of
    reproducibility to me?

    View Slide

  23. Reproducibility will
    make your life much
    easier
    The rest of the talk looks at advantages to you

    View Slide

  24. Reproducibility will make your
    life much easier
    Who will replicate your
    analysis?

    View Slide

  25. Reproducibility will make your
    life much easier
    Who will replicate your
    analysis?
    Future you!

    View Slide

  26. Reproducibility will make your
    life much easier
    Hinders reproducibility
    Does not scale
    manual data processing is ‘old phylogenetics’
    widespread programmatic approaches are required

    View Slide

  27. Reproducibility will make your
    life much easier
    Current phylogenetics
    is not experimental
    How often have you tested the effect of
    Clustal parameter choices?

    View Slide

  28. Reproducible scripted
    ‘pipelines’ are
    inherently
    experimental
    experimental phylogenetics

    View Slide

  29. Reproducibility leads to
    experimental phylogenetics
    support
    gap trimming ‘relaxedness’
    a synthetic example:
    tree replicates built from alignments
    constructed with 10 different
    alignment parameters

    View Slide

  30. What is minimally
    required for
    reproducibility?

    View Slide

  31. What is minimally required for
    reproducibility?
    Should we really be
    aiming for minimal?
    archive it all

    View Slide

  32. Computational pipelines
    make complete
    reproducibility as easy as
    minimal reproducibility
    Only human users are concerned with minimal
    reproducibility

    View Slide

  33. Computational
    pipelines make this
    trivial
    All these things are done automatically
    “Frictionless” reproducibility
    How do I do this?
    Reproducible phylogenetics
    All these challenges are solved-problems for
    computer scientists

    View Slide

  34. ReproPhylo
    reproducible phylogenetics environment
    v1.0

    View Slide

  35. • Open phylogenetics environment
    • Uses standards
    • Frictionless reproducibility
    • Platform independent
    • Fast
    ReproPhylo
    Software: http://hulluni-bioinformatics.github.io/ReproPhylo
    v1.0
    Users welcome!
    Manual: http://goo.gl/aZeRXf

    View Slide

  36. ReproPhylo is an environment and approach not
    phylogenetic tree building software
    GenBank sequences
    and metadata
    Your
    sequences,
    alignments,
    trees
    Your metadata

    View Slide

  37. Automatic archiving of ALL
    Text report of all actions,
    analyses and results
    trees, alignments, sequences, metadata,
    provenance, methods
    & journal friendly zip files
    html electronic lab notebook automatically
    written, ease to browse
    Copy and paste Methods section for
    journals
    ReproPhylo is an environment and approach not
    phylogenetic tree building software

    View Slide

  38. ReproPhylo runs in user-
    friendly IPython notebook
    Analysis pipelines provided
    Edit to specify your data, and modify any
    parameters you wish, then run, inspect, repeat

    View Slide

  39. ReproPhylo runs in user-
    friendly IPython notebook
    Mixture of
    user
    manual &
    analysis
    framework
    change a
    parameter and
    hit Run

    View Slide

  40. code
    output
    Exploratory Data Analysis
    example

    View Slide

  41. Exploratory Data Analysis
    check this?
    real data
    Dunn et al 2008 doi:10.1038/nature06614

    View Slide

  42. Exploratory Data Analysis
    Dunn et al 2008 doi:10.1038/nature06614
    real data

    View Slide

  43. Meta data is retained
    tree can be labelled, or stat
    test done, with any data that
    can be harvested from
    original genbank file (or any
    other associated data file)
    sponge tree with morphological annotations at tips

    View Slide

  44. Electronic lab book
    Pipeline writes a human-readable text/html file
    documenting the experiment and outcomes
    including Methods section
    Data provenance and version control included
    Easy archiving for journal submission

    View Slide

  45. ReproPhylo writes very extensive Results automatically
    alignment statistics

    View Slide

  46. Allows experimental hypothesis-
    testing phylogenomics
    ReproPhylo opens new doors
    ReproPhylo
    ReproPhylo is environment & approach not
    tree building algorithm
    more than reproducibility

    View Slide

  47. ReproPhylo
    and molecular evolution
    Similar approach gives
    reproducible, comparative
    evolutionary genomics
    Amir Szitenberg
    Comparative genomics of
    transposon evolution
    Friday 11.20

    View Slide

  48. Reproducible
    Phylogenetics
    Dave Lunt, Amir Szitenberg, Max John, Mark Blaxter
    ReproPhylo
    slides available: speakerdeck.com/davelunt
    software: http://hulluni-bioinformatics.github.io/ReproPhylo

    View Slide