Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Keynote at LSU

Galaxy Keynote at LSU

A presentation at the LSU 3rd Annual Bioinformatics Conference


Anton Nekrutenko

April 17, 2015

More Decks by Anton Nekrutenko

Other Decks in Science


  1. Building data analysis ecosystem in life sciences with Galaxy @galaxyproject

    / #usegalaxy http://www.galaxyproject.org
  2. A continuing crisis in genomics research: reproducibility

  3. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  4. Reproducibility ≈ Engine efficiency Schwarz 2015 (DOI: 10.1126/science.aaa3276)

  5. Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies

    from 2010-2012 (https://osf.io/e81xl/wiki/home/)
  6. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  7. None
  8. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15
  9. Example: A tale of two Science papers

  10. Paper 1

  11. All you need for reproducing is here (Fig. 2)

  12. Paper 2

  13. None
  14. None
  15. None
  16. None
  17. Genomic signatures to guide the use of chemotherapeutics Anil Potti1,2,

    Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4, Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5, Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 & Joseph R Nevins1–3 Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including using them in combination with existing targeted therapies. Numerous advances have been achieved in the development, selection and application of chemotherapeutic agents, sometimes with remark- able clinical successes—as in the case of treatment for lymphomas or platinum-based therapy for testicular cancers1. In addition, in several instances, combination chemotherapy in the postoperative (adjuvant) setting has been curative. However, most people with advanced solid tumors will relapse and die of their disease. Moreover, administration of ineffective chemotherapy increases the probability of side effects, particularly those from cytotoxic agents, and of a consequent decrease in quality of life1,2. Recent work has demonstrated the value in using biomarkers to select individuals for various targeted therapeutics, including tamox- ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools to select those most likely to respond to the commonly used chemotherapeutic drugs are lacking3. With the goal of developing genomic predictors of chemotherapy sensitivity that could direct the use of cytotoxic agents to those most likely to respond, we combined in vitro drug response data, together with microarray gene expression data, to develop models that could potentially predict responses to various cytotoxic chemotherapeutic drugs4. We now show that these signatures can predict clinical or pathologic response to the corresponding drugs, including combina- tions of drugs. We further use the ability to predict deregulated oncogenic signaling pathways in tumors to develop a strategy that identifies opportunities for combining chemotherapeutic drugs with targeted therapeutic drugs in a way that best matches the character- istics of the individual. RESULTS A gene expression–based predictor of sensitivity to docetaxel To develop predictors of cytotoxic chemotherapeutic drug response, we used an approach similar to previous work analyzing the NCI-60 panel4 from the US National Cancer Institute (NCI). We first identified cell lines that were most resistant or sensitive to docetaxel (Fig. 1a,b) and then genes whose expression correlated most highly with drug sensitivity, and used Bayesian binary regression analysis to develop a model that differentiates a pattern of docetaxel sensitivity from that of resistance. A gene expression signature consisting of 50 genes was identified that classified cell lines on the basis of docetaxel sensitivity (Fig. 1b, right). In addition to leave-one-out cross-validation, we used an indepen- dent dataset derived from docetaxel sensitivity assays in a series of 30 lung and ovarian cancer cell lines for further validation. The significant correlation (P o 0.01, log-rank test) between the predicted probability of sensitivity to docetaxel (in both lung and ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory concentration (IC50) for docetaxel confirmed the capacity of the docetaxel predictor to predict sensitivity to the drug in cancer cell A R T I C L E S © 2011 Nature America, Inc. All rights reserved.
  18. The importance of being reproducible Starting in 2006, Potti published

    papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients
  19. The importance of being reproducible However, Keith Baggerly and Kevin

    Coombes demonstrate that the findings cannot be replicated Long and difficult fight to get this acknowledged, followed be a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…
  20. The importance of being reproducible NCI investigates, demands the software

    for the method be provided Not only could they not replicate the results, the software produced substantially different predictions when run again on the same data! Some scores changed from 5% to 95%, classifications changed ~25% of the time!
  21. How does this even pass peer review? DON’T TRUST BLACK

  22. Is reproducibility achievable?

  23. To answer this question we need to understand causes of

    the problem
  24. Who are we dealing with? Users Developers HPC

  25. Users troubles: - Data logistics - HPC - Poor knowledge

    of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility
  26. Developers’ grief: - Limited tool exposure - Parameter picking troubles

    - Data format nightmare - High profile publications
  27. HPC providers’ challenges: - Lack of HPC utilization skills -

    Software is not optimized - HPC is heterogeneous
  28. user HPC dev

  29. user HPC dev

  30. user HPC dev

  31. user HPC dev Galaxy

  32. Galaxy: accessible analysis system

  33. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  34. Galaxy’s ideological goals: How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  35. Galaxy’s practical goals: How to arm researchers with access to

    powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC
  36. Galaxy’s goals (an xkcd version) Galaxy no Galaxy

  37. Describe analysis tool behavior abstractly

  38. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details
  39. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  40. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  41. Visualization and visual analytics

  42. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  43. Galaxy in a world of increasingly complex analyses

  44. user HPC dev Galaxy

  45. user HPC dev

  46. We are in the age of multiple datasets

  47. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  48. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy
  49. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  50. Dataset collections complex combinations of datasets that can be treated

    as a single unit
  51. Dataset Collections Organize user data Individual Datasets Collection Collection Contents

  52. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  53. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing.
  54. Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo

  55. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  56. An analysis is really a workflow

  57. As analyses needs become increasingly complex, typical users have moved

    from running individual tools to primarily running workflows
  58. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  59. (Goecks et al. Cancer Medicine, 2015)

  60. (Goecks et al. Cancer Medicine, 2015)

  61. However, for reproducibility, we want to be able to ensure

    that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results
  62. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  63. Fostering the tool developer community

  64. Galaxy has highly expressive tool definition syntax

  65. Conditionals

  66. Conditionals

  67. Conditionals

  68. Repeats

  69. Repeats

  70. Dynamic options

  71. And many others…

  72. The Galaxy Toolshed: Sharing tools, workflows, and their dependencies

  73. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  74. None
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. ToolShed Challenges Good for deployment and archiving, difficult for development

  82. New command line tools to address concerns from tool developers

    Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science
  83. Move to git[hub] centric development workflow Within three weeks, four

    major community contributions to core tools ols hub. eeks: ols of FastQC
  84. Tool citations, credit and incentivization Embed DOIs in Tool Configuration,

    Galaxy resolves and provides a list of citations, with links, which can exported for reference managers
  85. None
  86. ToolShed Challenges Complex dependency definitions, packaging dependencies is a rabbit

  87. Virtualize everything: control the host environment

  88. None
  89. None
  90. POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the

    cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
  91. Share a snapshot of this instance Current support for archiving

    instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main
  92. New approaches for dependency management Alternative approach for installing dependencies:

    Homebrew/Linuxbrew How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers
  93. What is Docker? Docker Virtual Machines “It run proce host

    o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits of VMs but much more lightweight and efficient
  94. ToolShed and Docker Tools can assert their dependencies are provided

    by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
  95. What is you ned a new, ad hoc, analysis within

  96. Interactive programming environments

  97. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  98. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  99. None
  100. None
  101. None
  102. None
  103. None
  104. None
  105. None
  106. Next steps Support for Jupyter (both Python and Julia) and

    RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool
  107. Visualization as a tool to make sense of complex data

  108. Towards a pluggable interactive visualization framework

  109. None
  110. Modifying Cufflinks parameters and locally reassembling

  111. PhyloViz from Google Summer of Code student Tomithy Too

  112. Circster,interactive circos-style plots

  113. Visualization framework: Charts plugin

  114. Visualization framework: Charts plugin

  115. ables users to quickly visualize tabular data. reencast

  116. Stuff that’s coming Backend workflow engine improvements to support the

    much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs
  117. Using Galaxy main to drive scalability improvements…

  118. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin
  119. funded by the National Science Foundation Award #ACI-1445604

  120. A user-friendly cloud environment designed to give researchers access to

    interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
  121. Summary Galaxy is an (obsessively) open framework for making data

    analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
  122. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
  123. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  124. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks