Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Keynote at LSU

Galaxy Keynote at LSU

A presentation at the LSU 3rd Annual Bioinformatics Conference

Anton Nekrutenko

April 17, 2015
Tweet

More Decks by Anton Nekrutenko

Other Decks in Science

Transcript

  1. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  2. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  3. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15
  4. Genomic signatures to guide the use of chemotherapeutics Anil Potti1,2,

    Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4, Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5, Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 & Joseph R Nevins1–3 Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including using them in combination with existing targeted therapies. Numerous advances have been achieved in the development, selection and application of chemotherapeutic agents, sometimes with remark- able clinical successes—as in the case of treatment for lymphomas or platinum-based therapy for testicular cancers1. In addition, in several instances, combination chemotherapy in the postoperative (adjuvant) setting has been curative. However, most people with advanced solid tumors will relapse and die of their disease. Moreover, administration of ineffective chemotherapy increases the probability of side effects, particularly those from cytotoxic agents, and of a consequent decrease in quality of life1,2. Recent work has demonstrated the value in using biomarkers to select individuals for various targeted therapeutics, including tamox- ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools to select those most likely to respond to the commonly used chemotherapeutic drugs are lacking3. With the goal of developing genomic predictors of chemotherapy sensitivity that could direct the use of cytotoxic agents to those most likely to respond, we combined in vitro drug response data, together with microarray gene expression data, to develop models that could potentially predict responses to various cytotoxic chemotherapeutic drugs4. We now show that these signatures can predict clinical or pathologic response to the corresponding drugs, including combina- tions of drugs. We further use the ability to predict deregulated oncogenic signaling pathways in tumors to develop a strategy that identifies opportunities for combining chemotherapeutic drugs with targeted therapeutic drugs in a way that best matches the character- istics of the individual. RESULTS A gene expression–based predictor of sensitivity to docetaxel To develop predictors of cytotoxic chemotherapeutic drug response, we used an approach similar to previous work analyzing the NCI-60 panel4 from the US National Cancer Institute (NCI). We first identified cell lines that were most resistant or sensitive to docetaxel (Fig. 1a,b) and then genes whose expression correlated most highly with drug sensitivity, and used Bayesian binary regression analysis to develop a model that differentiates a pattern of docetaxel sensitivity from that of resistance. A gene expression signature consisting of 50 genes was identified that classified cell lines on the basis of docetaxel sensitivity (Fig. 1b, right). In addition to leave-one-out cross-validation, we used an indepen- dent dataset derived from docetaxel sensitivity assays in a series of 30 lung and ovarian cancer cell lines for further validation. The significant correlation (P o 0.01, log-rank test) between the predicted probability of sensitivity to docetaxel (in both lung and ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory concentration (IC50) for docetaxel confirmed the capacity of the docetaxel predictor to predict sensitivity to the drug in cancer cell A R T I C L E S © 2011 Nature America, Inc. All rights reserved.
  5. The importance of being reproducible Starting in 2006, Potti published

    papers describing algorithms that take gene-expression data from a cancer cell and predict whether the cancer will be sensitive to a particular therapy Duke began three clinical trials based on the technology enrolling 110 patients
  6. The importance of being reproducible However, Keith Baggerly and Kevin

    Coombes demonstrate that the findings cannot be replicated Long and difficult fight to get this acknowledged, followed be a series of investigations So far, ten major paper retractions, all trials cancelled, two lawsuits ongoing…
  7. The importance of being reproducible NCI investigates, demands the software

    for the method be provided Not only could they not replicate the results, the software produced substantially different predictions when run again on the same data! Some scores changed from 5% to 95%, classifications changed ~25% of the time!
  8. Users troubles: - Data logistics - HPC - Poor knowledge

    of exiting tools - Inability to develop new tools - Lack of transparency and reproducibility
  9. Developers’ grief: - Limited tool exposure - Parameter picking troubles

    - Data format nightmare - High profile publications
  10. HPC providers’ challenges: - Lack of HPC utilization skills -

    Software is not optimized - HPC is heterogeneous
  11. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  12. Galaxy’s ideological goals: How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  13. Galaxy’s practical goals: How to arm researchers with access to

    powerful compute and latest tools How to build a community of tool developers How to run Galaxy on any HPC
  14. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  15. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  16. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  17. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  18. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy
  19. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  20. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  21. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing.
  22. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  23. As analyses needs become increasingly complex, typical users have moved

    from running individual tools to primarily running workflows
  24. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  25. However, for reproducibility, we want to be able to ensure

    that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results
  26. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed
  27. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  28. New command line tools to address concerns from tool developers

    Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science
  29. Move to git[hub] centric development workflow Within three weeks, four

    major community contributions to core tools ols hub. eeks: ols of FastQC
  30. Tool citations, credit and incentivization Embed DOIs in Tool Configuration,

    Galaxy resolves and provides a list of citations, with links, which can exported for reference managers
  31. POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the

    cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
  32. Share a snapshot of this instance Current support for archiving

    instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main
  33. New approaches for dependency management Alternative approach for installing dependencies:

    Homebrew/Linuxbrew How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers
  34. What is Docker? Docker Virtual Machines “It run proce host

    o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits of VMs but much more lightweight and efficient
  35. ToolShed and Docker Tools can assert their dependencies are provided

    by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
  36. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  37. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  38. Next steps Support for Jupyter (both Python and Julia) and

    RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool
  39. Stuff that’s coming Backend workflow engine improvements to support the

    much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs
  40. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin
  41. A user-friendly cloud environment designed to give researchers access to

    interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
  42. Summary Galaxy is an (obsessively) open framework for making data

    analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
  43. Dan Blankenberg Nate Coraor Dannon Baker Jeremy Goecks Anton Nekrutenko

    James Taylor Dave Clements Jennifer Jackson Engineering Support and outreach Custodians Carl Eberhard Dave Bouvier John Chilton Sam Guerler Martin Čech Enis Afgan Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health Nitesh Turaga The “Core” Galaxy Team
  44. Björn Grüning Uni Freiburg Peter Cock TJHI Kyle Ellrott UCSC

    Eric Rasche CPT Nicola Soranzo TGAC Brad Chapman HSPH Nuwan Goonasekera VeRSI Yousef Kowsar VLSCI Extended team and other contributors… And many others who have contributed to the main Galaxy code, tools to the ToolShed, participated in discussions, attended the Galaxy conferences, …
  45. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks