Galaxy Community Conference 2017: Evolution of Galaxy.

Galaxy Community Conference 2017: Evolution of Galaxy.

A somewhat accurate look back on the Galaxy project (with Anton Nekrutenko).

3ee44f53c39bcd4bc663a2ea0e21d526?s=128

James Taylor

June 29, 2017
Tweet

Transcript

  1. Evolution of : A rough timeline @jxtx / @nekrut /

    @galaxyproject / #usegalaxy
  2. mais avant de commencer...

  3. Dave Clements Marilyne Summo Patricia Laplagne Olivier Inizan Christophe Caron

    Gildas Le Corguillé Jean François Duffayard Nicole Vaslievsky Gautier Sarah Frédéric de Lamotte Virginie Rossard
  4. Avant 2005...

  5. None
  6. The legend goes like this: in 1980 Webb Miller and

    Eugene Myers asked Ross Hardison if there was anything interesting to do in biology... Ross Hardison Webb Miller Eugene Myers
  7. ...by early 2000’s the big data in biology was genomic

    sequences and alignments. Penn State was central in developing alignment tools Webb Miller Ross Hardison
  8. Webb Miller Ross Hardison ...by early 2000’s the big data

    in biology was genomic sequences and alignments. Penn State was central in developing alignment tools
  9. The basic question in the early 2000’s was: What is

    aligned to what and does it overlap with anything interesting?
  10. 2003

  11. None
  12. GALA enabled query annotation information from the human genome, alongside

    alignments with the mouse genome, integrated with the UCSC browser, and allowed building up set queries using the results of previous queries (the birth of the History system)
  13. Can GALA be extended to other analyses?

  14. None
  15. None
  16. Galaxy as a PERL script (~50,000 lines)

  17. We threw the first one away (quickly) and rewrote from

    scratch in Python At this point we made several key design decisions that (in hindsight) determined whether we would succeed or fail (We got very lucky)
  18. 1. No longer store data in a database, but in

    flat files in various common formats This meant existing tools could be integrated easily because they did not need to change the data formats they work with or interact with a database It also meant that when high-throughput sequence data suddenly came along (2005), we were prepared to deal with data at that scale easily
  19. 2. Rather than build new analysis tools in the system,

    build an abstract configuration driven interface to command line tools We did this to make our lives easier, we had many analysis tools lying around that we didn’t want to rewrite for Galaxy But this was equally appealing to other developers who could now easily make their tools available to Biologists
  20. Early pythonic Galaxy circa 2005

  21. Dan Blankenberg and the letter P

  22. 3. Make the entire stack self-contained, allowing a complete Galaxy

    to be setup on most systems in minutes We primarily did this to engage tool developers, making it as easy as possible to develop new tool wrappers for contribution We envisioned those tools would all be made available through the main Galaxy service But it also provided a scaling strategy, making it easy for sites to run their own Galaxy
  23. 4. Open-source and openly developed from the first commit Provide

    everything we do under a liberal open-source license (no copyleft), and only support open-source tools on the main instance Our primary development repository is exposed to the public, initially hosted by us but later moved out third parties (bitbucket.org, and then github.com) The software is distributed only through version control, with a rapid release cycle (at least monthly)
  24. Connection with UCSC

  25. None
  26. None
  27. 2005: NGS begins: Il pleut des cordes

  28. The basic question in the late 2000’s becomes: What would

    happen if I sequence the s****t out of anything?* *For metagenomic studies this was, in fact, precisely the question asked.
  29. 2005

  30. 2006

  31. 2007

  32. 2007 No workflows?! I thought Galaxy was a workflow system...

  33. 2007

  34. 2008

  35. 2010- The modern Galaxy era

  36. 2010

  37. 2010

  38. 2010

  39. None
  40. None
  41. None
  42. Best thing about the introduction of the ToolShed: Birth of

    the Intergalactic Utilities Commission
  43. 2012

  44. None
  45. Also this happened...

  46. None
  47. The great flood of 2012

  48. The great flood of 2012 Your data here

  49. ...In which Anton almost loses his job

  50. Stability and sustainability crisis!

  51. 2013

  52. ...In which Save main 2013 , , and ,

  53. 2013

  54. 2014

  55. None
  56. The community established itself and the evolutionary timeline accelerated! The

    only bad thing about it is that it hard to put things in chronological order from memory without using git log
  57. This included many things covered today and tomorrow including: -

    Visualizations beyond trackster - Expansion beyond genomics - Massive tool suite contributions and updates - Interactive environments - Training & Tours - … uhhh … so much more
  58. More today and tomorrow! Stay tuned.

  59. evolution of biology

  60. None
  61. The problem is that PubMed indexes Nature and Science that

    are scientific journals with a broad subject coverage
  62. Being in sciences let’s invent an index: G i =

    true Galaxy pubs/false Galaxy pubs 2005: G i = 1/15 2017: G i = 11/14
  63. but seriously...

  64. None
  65. from the beginning we tried to focus on biology

  66. None
  67. ...we even invented a study just to do analyses in

    Galaxy
  68. None
  69. “Just so you know, you've got a lot of really

    rare specimens preserved here”
  70. None
  71. … and

  72. None
  73. evolution of community

  74. In 2006 Ross Lazarus was Galaxy community

  75. He suggested having a conference

  76. A conferences needed T-shirts

  77. None
  78. Hans-Rudolf Hotz: La connexion européenne!

  79. None
  80. First Galaxy Developer Community Conference

  81. None
  82. None
  83. Björn Grüning is connected to Internet directly (probably born with

    802.11 circuitry. His particular hardware version lacks sleep functionality)
  84. None
  85. None
  86. None
  87. None
  88. None
  89. None
  90. None
  91. And now the meeting starts...

  92. Community activity grows immensely...

  93. galaxyproject/galaxy Contributors with ≥ 10 commits excluding galaxy team

  94. galaxyproject/tools-iuc Contributors with ≥ 10 commits excluding galaxy team

  95. galaxyproject/training-materials Contributors with ≥ 10 commits excluding galaxy team

  96. Looking forward...

  97. evolution across analyses scales

  98. Analysis Process Phase (exploratory) (batch) Analysis Scale

  99. Analysis Scale Analysis Process Phase (exploratory) (batch) 2006 Galaxy: Batch

    analysis of 10s of datasets
  100. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch

  101. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch

  102. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 2008

    Galaxy: Workflows: 100s of datasets
  103. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch
  104. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 2017 Galaxy: 10k - 100k datasets
  105. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ?
  106. We need better ways to look at, think about, and

    manage datasets and the 100k scale. At some point users no longer care about seeing the individual history, workflow, just specific results. New: many workflow view, for monitoring the execution of many workflows in parallel New: reports — generate summaries of executing workflows, multiple workflows, from user templates with continuous updates
  107. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ?
  108. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ?
  109. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? Interactive Environments: 10s of datasets, ad hoc analyses
  110. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? ad hoc, more flexible
  111. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? ad hoc, more flexible Visualization and analytics 10s of datasets, highly interactive
  112. Analysis Scale Analysis Process Phase (exploratory) (batch) 10s, batch 100s,

    batch 100k, batch ? ad hoc, more flexible visual exploration ?
  113. We need to support exploratory data analysis even more than

    we do now Dataset complexity, heterogeneity, dimensionality and all only increasing The analysis decision process requires more support for data exploration, both visual and interactive data manipulation
  114. The future Galaxy needs to scale seamlessly across the data

    analysis process… …supporting analysts as they transition from exploratory, to batch, to high-throughput
  115. At either end of the spectrum, there are common themes.

    The future Galaxy embraces real time and continuous communication. From exploratory analysis to batch job tracking to automatic reports, Galaxy needs to be responsive and informative. The future Galaxy is increasingly interactive The future Galaxy better supports transitions between analysis modes.
  116. So what is the Galaxy team?

  117. None
  118. The future of this project depends solely on the community,

    its openness, and continuing outreach!
  119. Thank you! Now please stay for talks that actually contain

    useful information!