Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Community Conference 2017: Evolution of Galaxy.

Galaxy Community Conference 2017: Evolution of Galaxy.

A somewhat accurate look back on the Galaxy project (with Anton Nekrutenko).

James Taylor

June 29, 2017
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. Evolution of :
    A rough timeline
    @jxtx / @nekrut / @galaxyproject / #usegalaxy

    View Slide

  2. mais avant de commencer...

    View Slide

  3. Dave Clements
    Marilyne Summo
    Patricia Laplagne
    Olivier Inizan
    Christophe Caron
    Gildas Le Corguillé
    Jean François Duffayard
    Nicole Vaslievsky
    Gautier Sarah
    Frédéric de Lamotte
    Virginie Rossard

    View Slide

  4. Avant 2005...

    View Slide

  5. View Slide

  6. The legend goes like this: in 1980 Webb Miller and
    Eugene Myers asked Ross Hardison if there was
    anything interesting to do in biology...
    Ross Hardison Webb Miller Eugene Myers

    View Slide

  7. ...by early 2000’s the big data in biology was
    genomic sequences and alignments. Penn State
    was central in developing alignment tools
    Webb Miller Ross Hardison

    View Slide

  8. Webb Miller Ross Hardison
    ...by early 2000’s the big data in biology was
    genomic sequences and alignments. Penn State
    was central in developing alignment tools

    View Slide

  9. The basic question in the early 2000’s was:
    What is aligned to what and does it overlap with
    anything interesting?

    View Slide

  10. 2003

    View Slide

  11. View Slide

  12. GALA enabled query annotation information from the human genome, alongside alignments with
    the mouse genome, integrated with the UCSC browser, and allowed building up set queries using
    the results of previous queries (the birth of the History system)

    View Slide

  13. Can GALA be extended to other analyses?

    View Slide

  14. View Slide

  15. View Slide

  16. Galaxy as a PERL script (~50,000 lines)

    View Slide

  17. We threw the first one away (quickly) and rewrote
    from scratch in Python
    At this point we made several key design decisions
    that (in hindsight) determined whether we would
    succeed or fail
    (We got very lucky)

    View Slide

  18. 1. No longer store data in a database, but in flat
    files in various common formats
    This meant existing tools could be integrated
    easily because they did not need to change the
    data formats they work with or interact with a
    database
    It also meant that when high-throughput sequence
    data suddenly came along (2005), we were
    prepared to deal with data at that scale easily

    View Slide

  19. 2. Rather than build new analysis tools in the system,
    build an abstract configuration driven interface to
    command line tools
    We did this to make our lives easier, we had many analysis
    tools lying around that we didn’t want to rewrite for Galaxy
    But this was equally appealing to other developers who
    could now easily make their tools available to Biologists

    View Slide

  20. Early pythonic Galaxy circa 2005

    View Slide

  21. Dan Blankenberg and the letter P

    View Slide

  22. 3. Make the entire stack self-contained, allowing a
    complete Galaxy to be setup on most systems in minutes
    We primarily did this to engage tool developers, making it
    as easy as possible to develop new tool wrappers for
    contribution
    We envisioned those tools would all be made available
    through the main Galaxy service
    But it also provided a scaling strategy, making it easy for
    sites to run their own Galaxy

    View Slide

  23. 4. Open-source and openly developed from the first commit
    Provide everything we do under a liberal open-source license
    (no copyleft), and only support open-source tools on the main
    instance
    Our primary development repository is exposed to the public,
    initially hosted by us but later moved out third parties
    (bitbucket.org, and then github.com)
    The software is distributed only through version control, with a
    rapid release cycle (at least monthly)

    View Slide

  24. Connection with UCSC

    View Slide

  25. View Slide

  26. View Slide

  27. 2005: NGS begins: Il pleut des cordes

    View Slide

  28. The basic question in the late 2000’s becomes:
    What would happen if I sequence the s****t out of
    anything?*
    *For metagenomic studies this was, in fact, precisely the question asked.

    View Slide

  29. 2005

    View Slide

  30. 2006

    View Slide

  31. 2007

    View Slide

  32. 2007
    No workflows?!
    I thought Galaxy
    was a workflow
    system...

    View Slide

  33. 2007

    View Slide

  34. 2008

    View Slide

  35. 2010- The modern Galaxy era

    View Slide

  36. 2010

    View Slide

  37. 2010

    View Slide

  38. 2010

    View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. Best thing about the introduction of the ToolShed:
    Birth of the Intergalactic Utilities Commission

    View Slide

  43. 2012

    View Slide

  44. View Slide

  45. Also this happened...

    View Slide

  46. View Slide

  47. The great flood of 2012

    View Slide

  48. The great flood of 2012
    Your data here

    View Slide

  49. ...In which Anton almost loses his job

    View Slide

  50. Stability and sustainability crisis!

    View Slide

  51. 2013

    View Slide

  52. ...In which
    Save main
    2013
    , , and
    ,

    View Slide

  53. 2013

    View Slide

  54. 2014

    View Slide

  55. View Slide

  56. The community established itself and the
    evolutionary timeline accelerated! The only bad
    thing about it is that it hard to put things in
    chronological order from memory without using
    git log

    View Slide

  57. This included many things covered today and
    tomorrow including:
    - Visualizations beyond trackster
    - Expansion beyond genomics
    - Massive tool suite contributions and updates
    - Interactive environments
    - Training & Tours
    - … uhhh … so much more

    View Slide

  58. More today and tomorrow!
    Stay tuned.

    View Slide

  59. evolution of biology

    View Slide

  60. View Slide

  61. The problem is that PubMed indexes Nature and
    Science that are scientific journals with a broad
    subject coverage

    View Slide

  62. Being in sciences let’s invent an index:
    G
    i
    = true Galaxy pubs/false Galaxy pubs
    2005: G
    i
    = 1/15
    2017: G
    i
    = 11/14

    View Slide

  63. but seriously...

    View Slide

  64. View Slide

  65. from the beginning we tried to focus on biology

    View Slide

  66. View Slide

  67. ...we even invented a study just to do analyses in
    Galaxy

    View Slide

  68. View Slide

  69. “Just so you know, you've got a lot of really rare specimens
    preserved here”

    View Slide

  70. View Slide

  71. … and

    View Slide

  72. View Slide

  73. evolution of community

    View Slide

  74. In 2006 Ross Lazarus was Galaxy community

    View Slide

  75. He suggested having a conference

    View Slide

  76. A conferences needed T-shirts

    View Slide

  77. View Slide

  78. Hans-Rudolf Hotz: La connexion européenne!

    View Slide

  79. View Slide

  80. First Galaxy Developer Community Conference

    View Slide

  81. View Slide

  82. View Slide

  83. Björn Grüning is connected to Internet directly
    (probably born with 802.11 circuitry. His particular hardware
    version lacks sleep functionality)

    View Slide

  84. View Slide

  85. View Slide

  86. View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. View Slide

  91. And now the meeting starts...

    View Slide

  92. Community activity grows immensely...

    View Slide

  93. galaxyproject/galaxy
    Contributors with ≥ 10 commits excluding galaxy team

    View Slide

  94. galaxyproject/tools-iuc
    Contributors with ≥ 10 commits excluding galaxy team

    View Slide

  95. galaxyproject/training-materials
    Contributors with ≥ 10 commits excluding galaxy team

    View Slide

  96. Looking forward...

    View Slide

  97. evolution across analyses scales

    View Slide

  98. Analysis Process Phase
    (exploratory) (batch)
    Analysis Scale

    View Slide

  99. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    2006 Galaxy:
    Batch analysis of 10s of datasets

    View Slide

  100. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch

    View Slide

  101. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch

    View Slide

  102. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch
    2008 Galaxy:
    Workflows: 100s of datasets

    View Slide

  103. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch

    View Slide

  104. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch
    2017 Galaxy:
    10k - 100k datasets

    View Slide

  105. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View Slide

  106. We need better ways to look at, think about, and
    manage datasets and the 100k scale.
    At some point users no longer care about seeing the
    individual history, workflow, just specific results.
    New: many workflow view, for monitoring the
    execution of many workflows in parallel
    New: reports — generate summaries of executing
    workflows, multiple workflows, from user templates
    with continuous updates

    View Slide

  107. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View Slide

  108. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?

    View Slide

  109. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    Interactive Environments:
    10s of datasets, ad hoc analyses

    View Slide

  110. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible

    View Slide

  111. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible
    Visualization and analytics
    10s of datasets, highly interactive

    View Slide

  112. Analysis Scale
    Analysis Process Phase
    (exploratory) (batch)
    10s, batch 100s, batch 100k, batch
    ?
    ad hoc,
    more flexible
    visual
    exploration
    ?

    View Slide

  113. We need to support exploratory data analysis even
    more than we do now
    Dataset complexity, heterogeneity, dimensionality
    and all only increasing
    The analysis decision process requires more support
    for data exploration, both visual and interactive data
    manipulation

    View Slide

  114. The future Galaxy needs to scale seamlessly across the
    data analysis process…
    …supporting analysts as they transition from
    exploratory, to batch, to high-throughput

    View Slide

  115. At either end of the spectrum, there are common
    themes.
    The future Galaxy embraces real time and continuous
    communication. From exploratory analysis to batch
    job tracking to automatic reports, Galaxy needs to be
    responsive and informative.
    The future Galaxy is increasingly interactive
    The future Galaxy better supports transitions between
    analysis modes.

    View Slide

  116. So what is the Galaxy team?

    View Slide

  117. View Slide

  118. The future of this project depends
    solely on the community, its
    openness, and continuing outreach!

    View Slide

  119. Thank you!
    Now please stay for talks that actually
    contain useful information!

    View Slide