Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Community Conference 2017: Evolution of Galaxy.

Galaxy Community Conference 2017: Evolution of Galaxy.

A somewhat accurate look back on the Galaxy project (with Anton Nekrutenko).

James Taylor

June 29, 2017
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Evolution of :
    A rough timeline
    @jxtx / @nekrut / @galaxyproject / #usegalaxy

    View full-size slide

  2. mais avant de commencer...

    View full-size slide

  3. Dave Clements
    Marilyne Summo
    Patricia Laplagne
    Olivier Inizan
    Christophe Caron
    Gildas Le Corguillé
    Jean François Duffayard
    Nicole Vaslievsky
    Gautier Sarah
    Frédéric de Lamotte
    Virginie Rossard

    View full-size slide

  4. Avant 2005...

    View full-size slide

  5. The legend goes like this: in 1980 Webb Miller and
    Eugene Myers asked Ross Hardison if there was
    anything interesting to do in biology...
    Ross Hardison Webb Miller Eugene Myers

    View full-size slide

  6. ...by early 2000’s the big data in biology was
    genomic sequences and alignments. Penn State
    was central in developing alignment tools
    Webb Miller Ross Hardison

    View full-size slide

  7. Webb Miller Ross Hardison
    ...by early 2000’s the big data in biology was
    genomic sequences and alignments. Penn State
    was central in developing alignment tools

    View full-size slide

  8. The basic question in the early 2000’s was:
    What is aligned to what and does it overlap with
    anything interesting?

    View full-size slide

  9. GALA enabled query annotation information from the human genome, alongside alignments with
    the mouse genome, integrated with the UCSC browser, and allowed building up set queries using
    the results of previous queries (the birth of the History system)

    View full-size slide

  10. Can GALA be extended to other analyses?

    View full-size slide

  11. Galaxy as a PERL script (~50,000 lines)

    View full-size slide

  12. We threw the first one away (quickly) and rewrote
    from scratch in Python
    At this point we made several key design decisions
    that (in hindsight) determined whether we would
    succeed or fail
    (We got very lucky)

    View full-size slide

  13. 1. No longer store data in a database, but in flat
    files in various common formats
    This meant existing tools could be integrated
    easily because they did not need to change the
    data formats they work with or interact with a
    database
    It also meant that when high-throughput sequence
    data suddenly came along (2005), we were
    prepared to deal with data at that scale easily

    View full-size slide

  14. 2. Rather than build new analysis tools in the system,
    build an abstract configuration driven interface to
    command line tools
    We did this to make our lives easier, we had many analysis
    tools lying around that we didn’t want to rewrite for Galaxy
    But this was equally appealing to other developers who
    could now easily make their tools available to Biologists

    View full-size slide

  15. Early pythonic Galaxy circa 2005

    View full-size slide

  16. Dan Blankenberg and the letter P

    View full-size slide

  17. 3. Make the entire stack self-contained, allowing a
    complete Galaxy to be setup on most systems in minutes
    We primarily did this to engage tool developers, making it
    as easy as possible to develop new tool wrappers for
    contribution
    We envisioned those tools would all be made available
    through the main Galaxy service
    But it also provided a scaling strategy, making it easy for
    sites to run their own Galaxy

    View full-size slide

  18. 4. Open-source and openly developed from the first commit
    Provide everything we do under a liberal open-source license
    (no copyleft), and only support open-source tools on the main
    instance
    Our primary development repository is exposed to the public,
    initially hosted by us but later moved out third parties
    (bitbucket.org, and then github.com)
    The software is distributed only through version control, with a
    rapid release cycle (at least monthly)

    View full-size slide

  19. Connection with UCSC

    View full-size slide

  20. 2005: NGS begins: Il pleut des cordes

    View full-size slide

  21. The basic question in the late 2000’s becomes:
    What would happen if I sequence the s****t out of
    anything?*
    *For metagenomic studies this was, in fact, precisely the question asked.

    View full-size slide

  22. 2007
    No workflows?!
    I thought Galaxy
    was a workflow
    system...

    View full-size slide

  23. 2010- The modern Galaxy era

    View full-size slide

  24. Best thing about the introduction of the ToolShed:
    Birth of the Intergalactic Utilities Commission

    View full-size slide

  25. Also this happened...

    View full-size slide

  26. The great flood of 2012

    View full-size slide

  27. The great flood of 2012
    Your data here

    View full-size slide

  28. ...In which Anton almost loses his job

    View full-size slide

  29. Stability and sustainability crisis!

    View full-size slide

  30. ...In which
    Save main
    2013
    , , and
    ,

    View full-size slide

  31. The community established itself and the
    evolutionary timeline accelerated! The only bad
    thing about it is that it hard to put things in
    chronological order from memory without using
    git log

    View full-size slide

  32. This included many things covered today and
    tomorrow including:
    - Visualizations beyond trackster
    - Expansion beyond genomics
    - Massive tool suite contributions and updates
    - Interactive environments
    - Training & Tours
    - … uhhh … so much more

    View full-size slide

  33. More today and tomorrow!
    Stay tuned.

    View full-size slide

  34. evolution of biology

    View full-size slide

  35. The problem is that PubMed indexes Nature and
    Science that are scientific journals with a broad
    subject coverage

    View full-size slide

  36. Being in sciences let’s invent an index:
    G
    i
    = true Galaxy pubs/false Galaxy pubs
    2005: G
    i
    = 1/15
    2017: G
    i
    = 11/14

    View full-size slide

  37. but seriously...

    View full-size slide

  38. from the beginning we tried to focus on biology

    View full-size slide

  39. ...we even invented a study just to do analyses in
    Galaxy

    View full-size slide

  40. “Just so you know, you've got a lot of really rare specimens
    preserved here”

    View full-size slide

  41. evolution of community

    View full-size slide

  42. In 2006 Ross Lazarus was Galaxy community

    View full-size slide

  43. He suggested having a conference

    View full-size slide

  44. A conferences needed T-shirts

    View full-size slide

  45. Hans-Rudolf Hotz: La connexion européenne!

    View full-size slide

  46. First Galaxy Developer Community Conference

    View full-size slide

  47. Björn Grüning is connected to Internet directly
    (probably born with 802.11 circuitry. His particular hardware
    version lacks sleep functionality)

    View full-size slide

  48. And now the meeting starts...

    View full-size slide

  49. Community activity grows immensely...

    View full-size slide

  50. galaxyproject/galaxy
    Contributors with ≥ 10 commits excluding galaxy team

    View full-size slide

  51. galaxyproject/tools-iuc
    Contributors with ≥ 10 commits excluding galaxy team

    View full-size slide

  52. galaxyproject/training-materials
    Contributors with ≥ 10 commits excluding galaxy team

    View full-size slide

  53. Looking forward...

    View full-size slide

  54. evolution across analyses scales

    View full-size slide

  55. Analysis Process Phase
    (exploratory) (batch)
    Analysis Scale

    View full-size slide

  56. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    2006 Galaxy:
    Batch analysis of 10s of datasets

    View full-size slide

  57. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch

    View full-size slide

  58. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch

    View full-size slide

  59. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch
    2008 Galaxy:
    Workflows: 100s of datasets

    View full-size slide

  60. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch

    View full-size slide

  61. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch
    2017 Galaxy:
    10k - 100k datasets

    View full-size slide

  62. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View full-size slide

  63. We need better ways to look at, think about, and
    manage datasets and the 100k scale.
    At some point users no longer care about seeing the
    individual history, workflow, just specific results.
    New: many workflow view, for monitoring the
    execution of many workflows in parallel
    New: reports — generate summaries of executing
    workflows, multiple workflows, from user templates
    with continuous updates

    View full-size slide

  64. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View full-size slide

  65. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View full-size slide

  66. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    Interactive Environments:
    10s of datasets, ad hoc analyses

    View full-size slide

  67. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible

    View full-size slide

  68. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible
    Visualization and analytics
    10s of datasets, highly interactive

    View full-size slide

  69. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible
    visual
    exploration
    ?

    View full-size slide

  70. We need to support exploratory data analysis even
    more than we do now
    Dataset complexity, heterogeneity, dimensionality
    and all only increasing
    The analysis decision process requires more support
    for data exploration, both visual and interactive data
    manipulation

    View full-size slide

  71. The future Galaxy needs to scale seamlessly across the
    data analysis process…
    …supporting analysts as they transition from
    exploratory, to batch, to high-throughput

    View full-size slide

  72. At either end of the spectrum, there are common
    themes.
    The future Galaxy embraces real time and continuous
    communication. From exploratory analysis to batch
    job tracking to automatic reports, Galaxy needs to be
    responsive and informative.
    The future Galaxy is increasingly interactive
    The future Galaxy better supports transitions between
    analysis modes.

    View full-size slide

  73. So what is the Galaxy team?

    View full-size slide

  74. The future of this project depends
    solely on the community, its
    openness, and continuing outreach!

    View full-size slide

  75. Thank you!
    Now please stay for talks that actually
    contain useful information!

    View full-size slide