Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Galaxy Keynote at LSU

Galaxy Keynote at LSU

A presentation at the LSU 3rd Annual Bioinformatics Conference

Anton Nekrutenko

April 17, 2015
Tweet

More Decks by Anton Nekrutenko

Other Decks in Science

Transcript

  1. Building data analysis
    ecosystem in life sciences
    with Galaxy
    @galaxyproject / #usegalaxy
    http://www.galaxyproject.org

    View Slide

  2. A continuing crisis in genomics research:
    reproducibility

    View Slide

  3. What is reproducibility?
    (for computational analyses)
    Reproducibility is not provenance, reusability/
    generalizability, or correctness
    Reproducibility means that an analysis is
    described/captured in sufficient detail that it can
    be precisely reproduced (given the data)
    Yet most published analyses are not reproducible 

    (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible;
    Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible)
    Missing software, versions, parameters, data…

    View Slide

  4. Reproducibility ≈ Engine efficiency
    Schwarz 2015 (DOI: 10.1126/science.aaa3276)

    View Slide

  5. Reproducibility Project: Cancer Biology
    Independently replicating 50 “high-impact” cancer
    studies from 2010-2012
    (https://osf.io/e81xl/wiki/home/)

    View Slide

  6. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa; Iorns, Elizabeth (2014):
    Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare.
    http://dx.doi.org/10.6084/m9.figshare.987130
    32/127 tools
    6/41 papers

    View Slide

  7. View Slide

  8. #METHODSMATTER
    Figure 1
    0.480
    0.483
    0.486
    0.489
    0.492
    0.495
    0.498
    0.501
    0.504
    0.507
    0.510
    5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1
    Frequency Fluctuation for site 8992
    Default -n 3 -q 15 -n 3 -q 15

    View Slide

  9. Example:
    A tale of two Science papers

    View Slide

  10. Paper 1

    View Slide

  11. All you need for reproducing is here
    (Fig. 2)

    View Slide

  12. Paper 2

    View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. Genomic signatures to guide the use of
    chemotherapeutics
    Anil Potti1,2, Holly K Dressman1,3, Andrea Bild1,3, Richard F Riedel1,2, Gina Chan4, Robyn Sayer4,
    Janiel Cragun4, Hope Cottrill4, Michael J Kelley2, Rebecca Petersen5, David Harpole5, Jeffrey Marks5,
    Andrew Berchuck1,6, Geoffrey S Ginsburg1,2, Phillip Febbo1–3, Johnathan Lancaster4 &
    Joseph R Nevins1–3
    Using in vitro drug sensitivity data coupled with Affymetrix microarray data, we developed gene expression signatures that predict
    sensitivity to individual chemotherapeutic drugs. Each signature was validated with response data from an independent set of cell
    line studies. We further show that many of these signatures can accurately predict clinical response in individuals treated with
    these drugs. Notably, signatures developed to predict response to individual agents, when combined, could also predict response
    to multidrug regimens. Finally, we integrated the chemotherapy response signatures with signatures of oncogenic pathway
    deregulation to identify new therapeutic strategies that make use of all available drugs. The development of gene expression
    profiles that can predict response to commonly used cytotoxic agents provides opportunities to better use these drugs, including
    using them in combination with existing targeted therapies.
    Numerous advances have been achieved in the development, selection
    and application of chemotherapeutic agents, sometimes with remark-
    able clinical successes—as in the case of treatment for lymphomas or
    platinum-based therapy for testicular cancers1. In addition, in several
    instances, combination chemotherapy in the postoperative (adjuvant)
    setting has been curative. However, most people with advanced solid
    tumors will relapse and die of their disease. Moreover, administration
    of ineffective chemotherapy increases the probability of side effects,
    particularly those from cytotoxic agents, and of a consequent decrease
    in quality of life1,2.
    Recent work has demonstrated the value in using biomarkers to
    select individuals for various targeted therapeutics, including tamox-
    ifen, trastuzumab and imatinib mesylate. In contrast, equivalent tools
    to select those most likely to respond to the commonly used
    chemotherapeutic drugs are lacking3.
    With the goal of developing genomic predictors of chemotherapy
    sensitivity that could direct the use of cytotoxic agents to those most
    likely to respond, we combined in vitro drug response data, together
    with microarray gene expression data, to develop models that could
    potentially predict responses to various cytotoxic chemotherapeutic
    drugs4. We now show that these signatures can predict clinical or
    pathologic response to the corresponding drugs, including combina-
    tions of drugs. We further use the ability to predict deregulated
    oncogenic signaling pathways in tumors to develop a strategy that
    identifies opportunities for combining chemotherapeutic drugs with
    targeted therapeutic drugs in a way that best matches the character-
    istics of the individual.
    RESULTS
    A gene expression–based predictor of sensitivity to docetaxel
    To develop predictors of cytotoxic chemotherapeutic drug response,
    we used an approach similar to previous work analyzing the NCI-60
    panel4 from the US National Cancer Institute (NCI). We first
    identified cell lines that were most resistant or sensitive to docetaxel
    (Fig. 1a,b) and then genes whose expression correlated most highly
    with drug sensitivity, and used Bayesian binary regression analysis to
    develop a model that differentiates a pattern of docetaxel sensitivity
    from that of resistance. A gene expression signature consisting of 50
    genes was identified that classified cell lines on the basis of docetaxel
    sensitivity (Fig. 1b, right).
    In addition to leave-one-out cross-validation, we used an indepen-
    dent dataset derived from docetaxel sensitivity assays in a series
    of 30 lung and ovarian cancer cell lines for further validation.
    The significant correlation (P o 0.01, log-rank test) between the
    predicted probability of sensitivity to docetaxel (in both lung and
    ovarian cell lines) (Fig. 1c, left) and the respective 50% inhibitory
    concentration (IC50) for docetaxel confirmed the capacity of the
    docetaxel predictor to predict sensitivity to the drug in cancer cell
    A R T I C L E S
    © 2011 Nature America, Inc. All rights reserved.

    View Slide

  18. The importance of being reproducible
    Starting in 2006, Potti published papers describing
    algorithms that take gene-expression data from a
    cancer cell and predict whether the cancer will be
    sensitive to a particular therapy
    Duke began three clinical trials based on the
    technology enrolling 110 patients

    View Slide

  19. The importance of being reproducible
    However, Keith Baggerly and Kevin Coombes
    demonstrate that the findings cannot be replicated
    Long and difficult fight to get this acknowledged,
    followed be a series of investigations
    So far, ten major paper retractions, all trials
    cancelled, two lawsuits ongoing…

    View Slide

  20. The importance of being reproducible
    NCI investigates, demands the software for the
    method be provided
    Not only could they not replicate the results, the
    software produced substantially different
    predictions when run again on the same data!
    Some scores changed from 5% to 95%,
    classifications changed ~25% of the time!

    View Slide

  21. How does this even pass peer review?
    DON’T TRUST BLACK BOXES!

    View Slide

  22. Is reproducibility achievable?

    View Slide

  23. To answer this question we need to understand
    causes of the problem

    View Slide

  24. Who are we dealing with?
    Users Developers HPC

    View Slide

  25. Users troubles:
    - Data logistics
    - HPC
    - Poor knowledge of exiting tools
    - Inability to develop new tools
    - Lack of transparency and reproducibility

    View Slide

  26. Developers’ grief:
    - Limited tool exposure
    - Parameter picking troubles
    - Data format nightmare
    - High profile publications

    View Slide

  27. HPC providers’ challenges:
    - Lack of HPC utilization skills
    - Software is not optimized
    - HPC is heterogeneous

    View Slide

  28. user
    HPC
    dev

    View Slide

  29. user
    HPC
    dev

    View Slide

  30. user
    HPC
    dev

    View Slide

  31. user
    HPC
    dev
    Galaxy

    View Slide

  32. Galaxy: accessible analysis system

    View Slide

  33. A free (for everyone) web service integrating a
    wealth of tools, compute resources, terabytes of
    reference data and permanent storage
    Open source software that makes integrating your
    own tools and data and customizing for your own
    site simple
    An open extensible platform for sharing tools,
    datatypes, workflows, ...

    View Slide

  34. Galaxy’s ideological goals:
    How best can data intensive methods be
    accessible to scientists?
    How best to facilitate transparent
    communication of computational analyses?
    How best to ensure that analyses are
    reproducible?

    View Slide

  35. Galaxy’s practical goals:
    How to arm researchers with access to powerful
    compute and latest tools
    How to build a community of tool developers
    How to run Galaxy on any HPC

    View Slide

  36. Galaxy’s goals (an xkcd version)
    Galaxy no Galaxy

    View Slide

  37. Describe analysis tool
    behavior abstractly

    View Slide

  38. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details

    View Slide

  39. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically

    View Slide

  40. Describe analysis tool
    behavior abstractly
    Analysis environment automatically
    and transparently tracks details
    Workflow system for complex analysis,
    constructed explicitly or automatically
    Pervasive sharing, and publication
    of documents with integrated analysis

    View Slide

  41. Visualization and visual analytics

    View Slide

  42. Ways to use Galaxy
    The public web service at http://usegalaxy.org
    Install locally with many compute environments
    Deploy on a cloud using Cloudman
    Atmosphere

    View Slide

  43. Galaxy in a world of increasingly
    complex analyses

    View Slide

  44. user
    HPC
    dev
    Galaxy

    View Slide

  45. user
    HPC
    dev

    View Slide

  46. We are in the age of multiple datasets

    View Slide

  47. Galaxy’s user interface is designed to be
    simple and intuitive for users without
    informatics expertise
    Can we scale this user interface to the
    analysis of hundreds of samples while
    maintaining interface idioms and usability?

    View Slide

  48. Users typically use many histories when working with many samples;
    New multiple history view makes working with 100s of histories easy

    View Slide

  49. A not-so-new feature: mapping over multiple datasets
    However, this breaks down for complex combinations of datasets
    (e.g. many sets of paired end reads, in replicates)

    View Slide

  50. Dataset collections
    complex combinations of datasets that can
    be treated as a single unit

    View Slide

  51. Dataset Collections
    Organize user data
    Individual Datasets Collection Collection Contents

    View Slide

  52. Operations over collections
    For “list” collections, existing tools can
    automatically be mapped across the entire
    collection
    Existing tools that support multiple inputs
    and one output act as reducers
    Many existing tools just work; but
    “structured” collections like “paired” need
    explicit support in tools

    View Slide

  53. Map/reduce in workflows
    More Powerful Workflows
    Arbitrary # of Inputs (...
    paired).
    Run applications in parallel (one per input).
    Merged output for
    subsequent processing.

    View Slide

  54. Enhanced Tuxedo Suite Workflow
    RNA-Seq workflow based using
    the Tuxedo suite.

    View Slide

  55. Dataset Collections
    Extremely flexible for grouping collections of
    complex datasets, can be nested to arbitrary
    depth, structure is preserved through
    mapping
    More complex reductions, other collection
    operations in progress
    Towards 10,000 samples: workflow
    scheduling improvements (backgrounding,
    decision points, streaming)

    View Slide

  56. An analysis is really a workflow

    View Slide

  57. As analyses needs become increasingly
    complex, typical users have moved from
    running individual tools to primarily
    running workflows

    View Slide

  58. For research use, users need to be able
    to construct and modify workflows, not
    just run existing best practice pipelines
    The Galaxy Workflow editor supports
    this use case well, providing ways for
    users to easily construct and modify
    workflows

    View Slide

  59. (Goecks et al. Cancer Medicine, 2015)

    View Slide

  60. (Goecks et al. Cancer Medicine, 2015)

    View Slide

  61. However, for reproducibility, we want to
    be able to ensure that a workflow can
    be exactly rerun, even in a different
    compute environment, and get exactly
    the same results

    View Slide

  62. 1 2 3 ∞
    http://usegalaxy.org
    http://usegalaxy.org/community
    ...
    Galaxies on
    private clouds
    Galaxies on
    public clouds
    ...
    private Galaxy installations
    Private Tool Sheds
    Galaxy Tool
    Shed

    View Slide

  63. Fostering the tool developer community

    View Slide

  64. Galaxy has highly expressive tool
    definition syntax

    View Slide

  65. Conditionals

    View Slide

  66. Conditionals

    View Slide

  67. Conditionals

    View Slide

  68. Repeats

    View Slide

  69. Repeats

    View Slide

  70. Dynamic options

    View Slide

  71. And many others…

    View Slide

  72. The Galaxy Toolshed: Sharing tools, workflows,
    and their dependencies

    View Slide

  73. Repositories are owned by the
    contributor, can contain tools,
    workflows, etc.
    Backed by version control, a complete
    version history is retained for everything
    that passes through the toolshed
    Galaxy instance admins can install tools
    directly from the toolshed using only a
    web UI
    Support for recipes for installing the
    underlying software that tools depend
    on (also versioned)

    View Slide

  74. View Slide

  75. View Slide

  76. View Slide

  77. View Slide

  78. View Slide

  79. View Slide

  80. View Slide

  81. ToolShed Challenges
    Good for deployment and archiving,
    difficult for development

    View Slide

  82. New command line tools to address concerns from
    tool developers
    Tool Development Planemo
    Command-line tools to aid development.
    ○ Test tools quickly without
    worrying about configuration
    files.
    ○ Check tools for common bugs
    and best practices.
    ○ Optimized publishing to the
    ToolShed.
    ○ Testbed for new dependency
    management - Homebrew and
    Homebrew-science

    View Slide

  83. Move to git[hub] centric development workflow
    Within three weeks, four major community contributions to core tools
    ols
    hub.
    eeks:
    ols
    of FastQC

    View Slide

  84. Tool citations, credit and incentivization
    Embed DOIs in Tool Configuration, Galaxy resolves and provides a list of
    citations, with links, which can exported for reference managers

    View Slide

  85. View Slide

  86. ToolShed Challenges
    Complex dependency definitions,
    packaging dependencies is a rabbit hole

    View Slide

  87. Virtualize everything: control the host
    environment

    View Slide

  88. View Slide

  89. View Slide

  90. POSTER PRESENTATION Open Access
    CLIA-certified next-generation sequencing
    analysis in the cloud
    Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4,
    Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1
    From Beyond the Genome 2012
    Boston, MA, USA. 27-29 September 2012
    The development of next-generation sequencing (NGS)
    technology opens new avenues for clinical researchers to
    make discoveries, especially in the area of clinical diag-
    nostics. However, combining NGS and clinical data pre-
    sents two challenges: first, the accessibility to clinicians
    of sufficient computing power needed for the analysis of
    high volume of NGS data; and second, the stringent
    requirements of accuracy and patient information data
    governance in a clinical setting.
    Cloud computing is a natural fit for addressing the
    computing power requirements, while Clinical Labora-
    tory Improvement Amendments (CLIA) certification
    provides a baseline standard for meeting the demands on
    researchers in working with clinical data. Combining a
    cloud-computing environment with CLIA certification
    presents its own challenges due to the level of control
    users have over the cloud environment and CLIA’s stabi-
    lity requirements. We have bridged this gap by creating a
    locked virtual machine with a pre-defined and validated
    set of workflows. This virtual machine is created using
    our Galaxy VM launcher tool to instantiate a Galaxy
    [http://www.usegalaxy.org] environment at Amazon with
    patient samples were analyzed using customized hybrid-
    capture bait libraries to boost read coverage in low-
    coverage regions, followed by targeted enrichment
    sequencing at the BioMedical Genomics Center. The
    NGS data is imported to a tested Galaxy single nucleo-
    tide polymorphism (SNP) detection workflow in a locked
    Galaxy virtual machine on Amazon’s Elastic Compute
    Cloud (EC2). This project illustrates our ability to carry
    out CLIA-certified NGS analysis in the cloud, and will
    provide valuable guidance in any future implementation
    of NGS analysis involving clinical diagnosis.
    Author details
    1Research Informatics Support System, Minnesota Supercomputing Institute,
    University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics
    and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA.
    3Molecular Diagnostics Laboratory, University of Minnesota Medical Center-
    Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical
    Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA.
    5Department of Laboratory Medicine and Pathology, University of Minnesota,
    Minneapolis, MN 55455, USA.
    Published: 1 October 2012
    Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54
    http://www.biomedcentral.com/1753-6561/6/S6/P54
    CLIA-certified Galaxy pipelines using virtual machines
    (Minnesota Supercomputing Institute)

    View Slide

  91. Share a snapshot of this instance
    Current support for archiving instances with CloudMan
    Plan to support archiving analyses both from custom 

    Galaxy instances and on Galaxy main

    View Slide

  92. New approaches for dependency
    management
    Alternative approach for installing
    dependencies: Homebrew/Linuxbrew
    How can we run community contributed tools
    safely and efficiently?
    Support for defining dependencies as Docker
    containers

    View Slide

  93. What is Docker?
    Docker
    Virtual Machines
    “It run
    proce
    host o
    sharin
    conta
    the re
    alloca
    but is
    and e
    What is Docker?
    https://d
    Traditional Virtual Machine
    Docker
    Kernel is shared between containers; achieves the isolation and
    management benefits of VMs but much more lightweight and efficient

    View Slide

  94. ToolShed and Docker
    Tools can assert their dependencies are provided by
    a Docker container
    Potentially tool execution is more secure
    due to isolation
    Easier for tool developers to package dependencies
    Much easier for end-users to get dependencies

    View Slide

  95. What is you ned a new, ad hoc, analysis
    within Galaxy

    View Slide

  96. Interactive programming environments

    View Slide

  97. For researchers without informatics
    expertise, the web UI and existing tools are
    often sufficient
    For informaticians, Galaxy provides an
    extensive API and wrappers (e.g. Bioblend)
    But, many users can do some
    programming, would like the benefits of
    Galaxy with the flexibility to do some
    scripting

    View Slide

  98. Docker enables interactive————
    environments
    Framework allows spinning up secure*
    isolated environments, that can interact
    with the Galaxy history through Galaxy’s
    API
    Initial implementation supporting iPython
    Notebook

    View Slide

  99. View Slide

  100. View Slide

  101. View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. View Slide

  106. Next steps
    Support for Jupyter (both Python and Julia)
    and RStudio environments
    Interactive programming environments as
    first class citizens: full provenance tracking,
    establish inputs and outputs, be used in
    workflows, etc.
    Databases as first class citizens, e.g. GEMINI
    query interface as a reusable tool

    View Slide

  107. Visualization as a tool to make sense of
    complex data

    View Slide

  108. Towards a pluggable interactive
    visualization framework

    View Slide

  109. View Slide

  110. Modifying Cufflinks parameters and locally reassembling

    View Slide

  111. PhyloViz from Google Summer of Code student Tomithy Too

    View Slide

  112. Circster,interactive circos-style plots

    View Slide

  113. Visualization framework: Charts plugin

    View Slide

  114. Visualization framework: Charts plugin

    View Slide

  115. ables users to quickly visualize tabular data.
    reencast

    View Slide

  116. Stuff that’s coming
    Backend workflow engine improvements to
    support the much larger analyses that can now
    be constructed in the UI (ongoing)
    Increasing complexity and control over how
    datasets are used
    Federation between Galaxy instances, support
    for transparently accessing data from other APIs

    View Slide

  117. Using Galaxy main to drive scalability
    improvements…

    View Slide

  118. PSC, Pittsburgh
    SDSC, San Diego
    Galaxy Cluster
    ● 256 cores
    ● 2 TB memory
    Rodeo
    ● 128 cores
    ● 1 TB memory
    Corral/Stockyard
    ● 20 PB disk
    Stampede
    ● 462,462 cores
    ● 205 TB memory
    Blacklight
    ● 4,096 cores
    ● 32 TB memory
    ● Dedicated
    resources
    Trestles
    ● 10,368 cores
    ● 20.7 TB memory
    ● Shared resources
    TACC
    Austin

    View Slide

  119. funded by the National Science Foundation
    Award #ACI-1445604

    View Slide

  120. A user-friendly cloud environment designed to give
    researchers access to interactive computing and
    data analysis resources on demand; researchers can
    create their own “private computing system” within
    Jetstream
    Two widely used biology platforms will be
    supported - Galaxy and iPlant
    Allow users to preserve VMs with Digital Object
    Identifiers (DOIs), which enables sharing of results,
    reproducibility of analyses, and new analyses of
    published research data.

    View Slide

  121. Summary
    Galaxy is an (obsessively) open framework for making
    data analysis accessible and reproducible
    Nearly everything in Galaxy is “pluggable”, allowing it
    to customized for myriad purposes
    New UI approaches are enabling more complex
    analysis of much larger numbers of datasets without
    sacrificing usability
    By supporting and leveraging tool developers the
    Galaxy community can collectively keep up with rapid
    changes in available tools

    View Slide

  122. Dan Blankenberg Nate Coraor
    Dannon Baker
    Jeremy Goecks
    Anton Nekrutenko
    James Taylor
    Dave Clements Jennifer Jackson
    Engineering
    Support and outreach Custodians
    Carl Eberhard
    Dave Bouvier
    John Chilton Sam Guerler
    Martin Čech
    Enis Afgan
    Supported by the NHGRI (HG005542, HG004909, HG005133, HG006620), NSF (DBI-0850103),
    Penn State University, Emory University, and the Pennsylvania Department of Public Health
    Nitesh Turaga
    The “Core” Galaxy Team

    View Slide

  123. Björn Grüning
    Uni Freiburg
    Peter Cock
    TJHI
    Kyle Ellrott
    UCSC
    Eric Rasche
    CPT
    Nicola Soranzo
    TGAC
    Brad Chapman
    HSPH
    Nuwan Goonasekera
    VeRSI
    Yousef Kowsar
    VLSCI
    Extended team and other contributors…
    And many others who have contributed to the
    main Galaxy code, tools to the ToolShed,
    participated in discussions, attended the Galaxy
    conferences, …

    View Slide

  124. Galaxy is a community!
    Join us on irc, mailing lists, Galaxy Biostar
    Contribute code on bitbucket, github, or the ToolShed
    Join us for a Hackathon or our annual conference
    Fifth annual Galaxy
    Community Conference
    Hackathon, training day,
    and two days of talks

    View Slide