Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy and Reproducibility

Galaxy and Reproducibility

Motivation behind Galaxy and how it all works


Anton Nekrutenko

June 09, 2016

More Decks by Anton Nekrutenko

Other Decks in Education


  1. Biology as a data intensive science: Computation is the new

    cloning @galaxyproject / #usegalaxy http://www.galaxyproject.org
  2. A continuing crisis in genomics research: reproducibility

  3. None
  4. None
  5. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  6. Reproducibility Project: Cancer Biology Independently replicating 50 “high-impact” cancer studies

    from 2010-2012 (https://osf.io/e81xl/wiki/home/)
  7. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  8. None
  9. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15
  10. Example: A tale of two Science papers

  11. Paper 1

  12. All you need for reproducing is here (Fig. 2)

  13. Paper 2

  14. None
  15. None
  16. None
  17. None
  18. Genomic signatures to guide the use of chemotherapeutics Anil Potti1,2,

    Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4, Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5, Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 & Joseph R Nevins1–3 Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including using them in combination with existing targeted therapies. Numerous advances have been achieved in the development, selection and application of chemotherapeutic agents, sometimes with remark- able clinical successes—as in the case of treatment for lymphomas or platinum-based therapy for testicular cancers1. In addition, in several instances, combination chemotherapy in the postoperative (adjuvant) setting has been curative. However, most people with advanced solid tumors will relapse and die of their disease. Moreover, administration of ineffective chemotherapy increases the probability of side effects, particularly those from cytotoxic agents, and of a consequent decrease in quality of life1,2. Recent work has demonstrated the value in using biomarkers to select individuals for various targeted therapeutics, including tamox- ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools to select those most likely to respond to the commonly used chemotherapeutic drugs are lacking3. With the goal of developing genomic predictors of chemotherapy sensitivity that could direct the use of cytotoxic agents to those most likely to respond, we combined in vitro drug response data, together with microarray gene expression data, to develop models that could potentially predict responses to various cytotoxic chemotherapeutic drugs4. We now show that these signatures can predict clinical or pathologic response to the corresponding drugs, including combina- tions of drugs. We further use the ability to predict deregulated oncogenic signaling pathways in tumors to develop a strategy that identifies opportunities for combining chemotherapeutic drugs with targeted therapeutic drugs in a way that best matches the character- istics of the individual. RESULTS A gene expression–based predictor of sensitivity to docetaxel To develop predictors of cytotoxic chemotherapeutic drug response, we used an approach similar to previous work analyzing the NCI-60 panel4 from the US National Cancer Institute (NCI). We first identified cell lines that were most resistant or sensitive to docetaxel (Fig. 1a,b) and then genes whose expression correlated most highly with drug sensitivity, and used Bayesian binary regression analysis to develop a model that differentiates a pattern of docetaxel sensitivity from that of resistance. A gene expression signature consisting of 50 genes was identified that classified cell lines on the basis of docetaxel sensitivity (Fig. 1b, right). In addition to leave-one-out cross-validation, we used an indepen- dent dataset derived from docetaxel sensitivity assays in a series of 30 lung and ovarian cancer cell lines for further validation. The significant correlation (P o 0.01, log-rank test) between the predicted probability of sensitivity to docetaxel (in both lung and ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory concentration (IC50) for docetaxel confirmed the capacity of the docetaxel predictor to predict sensitivity to the drug in cancer cell A R T I C L E S © 2011 Nature America, Inc. All rights reserved.
  19. The importance of being reproducible Starting in 2006, Potti published

    papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients
  20. The importance of being reproducible However, Keith Baggerly and Kevin

    Coombes demonstrate that the findings cannot be replicated Long and difficult fight to get this acknowledged, followed as a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…
  21. The importance of being reproducible NCI investigates, demands the software

    for the method be provided Not only could they not replicate the results, the software produced substantially different predictions when run again on the same data! Some scores changed from 5% to 95%, classifications changed ~25% of the time!
  22. How does this even pass peer review? DON’T TRUST BLACK

    BOXES! Be smart consumers!
  23. Is reproducibility achievable?

  24. To answer this question we need to understand causes of

    the problem
  25. Who are we dealing with? Users (Biologists) Developers HPC

  26. Users (Biologists) troubles: - Data logistics - HPC - Poor

    knowledge of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility
  27. Developers’ grief: - Limited tool exposure - Parameter picking troubles

    - Data format nightmare - High profile publications
  28. HPC providers’ challenges: - Lack of HPC utilization skills -

    Software is not optimized - HPC is heterogeneous
  29. user (Biologist) admin dev

  30. user admin dev

  31. admin user dev

  32. user dev Galaxy admin

  33. Galaxy: accessible analysis system

  34. Galaxy Servers Worldwide http://bit.ly/gxyServers

  35. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  36. Galaxy’s ideological goals: How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  37. Galaxy’s practical goals: How to arm researchers with access to

    powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC
  38. Describe analysis tool behavior abstractly

  39. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details
  40. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  41. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  42. Visualization and visual analytics

  43. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  44. Galaxy in a world of increasingly complex analyses

  45. user HPC dev Galaxy

  46. user HPC dev

  47. We are in the age of multiple datasets

  48. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  49. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy
  50. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  51. Dataset collections complex combinations of datasets that can be treated

    as a single unit
  52. Dataset Collections Organize user data Individual Datasets Collection Collection Contents

  53. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  54. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing.
  55. Enhanced Tuxedo Suite Workflow RNA-Seq workflow based using the Tuxedo

  56. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  57. An analysis is really a workflow

  58. As analyses needs become increasingly complex, typical users have moved

    from running individual tools to primarily running workflows
  59. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  60. (Goecks et al. Cancer Medicine, 2015)

  61. (Goecks et al. Cancer Medicine, 2015)

  62. However, for reproducibility, we want to be able to ensure

    that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results
  63. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  64. Fostering the tool developer community

  65. Galaxy has highly expressive tool definition syntax

  66. Conditionals

  67. Conditionals

  68. Conditionals

  69. Repeats

  70. Repeats

  71. Dynamic options

  72. And many others…

  73. The Galaxy Toolshed: Sharing tools, workflows, and their dependencies

  74. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  75. None
  76. None
  77. None
  78. None
  79. None
  80. None
  81. None
  82. ToolShed Challenges Good for deployment and archiving, difficult for development

  83. Tool citations, credit and incentivization Embed DOIs in Tool Configuration,

    Galaxy resolves and provides a list of citations, with links, which can exported for reference managers
  84. None
  85. ToolShed Challenges Complex dependency definitions, packaging dependencies is a rabbit

  86. Virtualize everything: control the host environment

  87. None
  88. None
  89. POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the

    cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
  90. Share a snapshot of this instance Current support for archiving

    instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main
  91. New approaches for dependency management Alternative approach for installing dependencies:

    Conda How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers
  92. What is Docker? Docker Virtual Machines “It run proce host

    o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits of VMs but much more lightweight and efficient
  93. ToolShed and Docker Tools can assert their dependencies are provided

    by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
  94. What is you ned a new, ad hoc, analysis within

  95. Interactive programming environments

  96. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  97. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  98. None
  99. None
  100. None
  101. None
  102. None
  103. None
  104. None
  105. Using Galaxy main to drive scalability improvements…

  106. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin
  107. funded by the National Science Foundation Award #ACI-1445604

  108. A user-friendly cloud environment designed to give researchers access to

    interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
  109. Summary Galaxy is an (obsessively) open framework for making data

    analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
  110. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
  111. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  112. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference 2016