Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Molecular Medicine Tri-con 2015

James Taylor
February 18, 2015

Molecular Medicine Tri-con 2015

Presentation on scaling Galaxy, particularly from the UI perspective, at Molecular Medicine Tri-con 2015 session "Large-scale genomics data transfer, analysis, and storage".

James Taylor

February 18, 2015
Tweet

More Decks by James Taylor

Other Decks in Science

Transcript

  1. What is reproducibility? (for computational analyses) Reproducibility is not provenance,

    reusability/ generalizability, or correctness Reproducibility means that an analysis is described/captured in sufficient detail that it can be precisely reproduced (given the data) Yet most published analyses are not reproducible 
 (see e.g. Ioannadis et al. 2009 — 6/18 microarray experiments reproducible; Nekrutenko and Taylor 2012, 7/50 resequencing experiments reproducible) Missing software, versions, parameters, data…
  2. Vasilevsky, Nicole; Kavanagh, David J; Deusen, Amy Van; Haendel, Melissa;

    Iorns, Elizabeth (2014): Unique Identification of research resources in studies in Reproducibility Project: Cancer Biology. figshare. http://dx.doi.org/10.6084/m9.figshare.987130 32/127 tools 6/41 papers
  3. #METHODSMATTER Figure 1 0.480 0.483 0.486 0.489 0.492 0.495 0.498

    0.501 0.504 0.507 0.510 5.2 5.3 5.4 5.5 5.6 5.7 5.8a 5.8c 5.9 5.9rc 5.1 6 6.1 Frequency Fluctuation for site 8992 Default -n 3 -q 15 -n 3 -q 15 (Nekrutenko and Taylor, Nature Reviews Genetics, 2012)
  4. A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …)

    Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (iPython notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)
  5. A spectrum of solutions Analysis environments (Galaxy, GenePattern, Mobyle, …)

    Workflow systems (Taverna, Pegasus, VisTrails, …) Notebook style (iPython notebook, …) Literate programming style (Sweave/knitR, …) System level provenance capture (ReproZip, …) Complete environment capture (VMs, containers, …)
  6. Galaxy’s motivating questions How best can data intensive methods be

    accessible to scientists? How best to facilitate transparent communication of computational analyses? How best to ensure that analyses are reproducible?
  7. A free (for everyone) web service integrating a wealth of

    tools, compute resources, terabytes of reference data and permanent storage Open source software that makes integrating your own tools and data and customizing for your own site simple An open extensible platform for sharing tools, datatypes, workflows, ...
  8. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically
  9. Describe analysis tool behavior abstractly Analysis environment automatically and transparently

    tracks details Workflow system for complex analysis, constructed explicitly or automatically Pervasive sharing, and publication of documents with integrated analysis
  10. Ways to use Galaxy The public web service at http://usegalaxy.org

    Install locally with many compute environments Deploy on a cloud using Cloudman Atmosphere
  11. 1) Shift from tools to workflows As analyses needs become

    increasingly complex, typical users have moved from running individual tools to primarily running workflows
  12. For research use, users need to be able to construct

    and modify workflows, not just run existing best practice pipelines The Galaxy Workflow editor supports this use case well, providing ways for users to easily construct and modify workflows
  13. ensures that the pipelines can evolve and incorporate new tools

    as they become available rather than requiring the development of new pipelines. The exome and transcriptome analysis pipelines require vastly more time and computing resources than the vari- ant analysis pipeline: the exome/transcriptome processing pipelines require about a day to complete on a small computing cluster, while the integrated variant analysis pipeline can be run in less than an hour. Also, there are established protocols for exome and transcriptome pro- cessing but less so for variant analysis. Hence, by splitting the pipelines up as we have and putting the pipelines in Galaxy, it is simple and fast to experiment with different settings in the variant analysis pipeline and find settings that are most useful for a particular set of samples. Results Validation using cell line data To validate our pipelines, we analyzed targeted exome and whole transcriptome sequencing data from three well-characterized pancreatic cancer cell lines: MIA PaCa2 (MP), HPAC, and PANC-1. Exonic regions of 577 genes that are commonly included in cancer gene panels were sequenced. All three cell lines are included in the Cancer Cell Line Encyclopedia (CCLE) [15]; the CCLE includes a mutational profile for known oncogenes and drug response information for each cell line. The goal of this analysis is to use our pipelines to process the cell line (A) (B) Figure 2. Galaxy Circos plot showing data produced from (A; at top) exome and transcriptome analysis of Mia PaCa2 cell line and (B; at bottom) transcriptome analysis of a pancreatic adenocarcinoma tumor. Starting at the innermost track, the data are: (i) mapped read coverage; (ii) mapped read coverage after PCR duplicates removed; (iii) called variants; (iv) rare and deleterious variants; (v) rare, deleterious, and druggable variants; (vi) rare and deleterious variants performance Figure 2A shows an interactive Galaxy- Circos plot of data generated from analysis of the MIA PaCa2 cell line. (A) (Goecks et al. Cancer Medicine, 2015)
  14. However, for reproducibility, we want to be able to ensure

    that a workflow can be exactly rerun, even in a different compute environment, and get exactly the same results
  15. 1 2 3 ∞ http://usegalaxy.org http://usegalaxy.org/community ... Galaxies on private

    clouds Galaxies on public clouds ... private Galaxy installations Private Tool Sheds Galaxy Tool Shed Greg von Kuster
  16. Repositories are owned by the contributor, can contain tools, workflows,

    etc. Backed by version control, a complete version history is retained for everything that passes through the toolshed Galaxy instance admins can install tools directly from the toolshed using only a web UI Support for recipes for installing the underlying software that tools depend on (also versioned)
  17. New command line tools to address concerns from tool developers

    Tool Development Planemo Command-line tools to aid development. ◦ Test tools quickly without worrying about configuration files. ◦ Check tools for common bugs and best practices. ◦ Optimized publishing to the ToolShed. ◦ Testbed for new dependency management - Homebrew and Homebrew-science John Chilton
  18. Move to git[hub] centric development workflow Within three weeks, four

    major community contributions to core tools ols hub. eeks: ols of FastQC
  19. Tool citations, credit and incentivization Embed DOIs in Tool Configuration,

    Galaxy resolves and provides a list of citations, with links, which can exported for reference managers
  20. POSTER PRESENTATION Open Access CLIA-certified next-generation sequencing analysis in the

    cloud Ying Zhang1*, Jesse Erdmann1, John Chilton1, Getiria Onsongo1, Matthew Bower2,3, Kenny Beckman4, Bharat Thyagarajan5, Kevin Silverstein1, Anne-Francoise Lamblin1, the Whole Galaxy Team at MSI1 From Beyond the Genome 2012 Boston, MA, USA. 27-29 September 2012 The development of next-generation sequencing (NGS) technology opens new avenues for clinical researchers to make discoveries, especially in the area of clinical diag- nostics. However, combining NGS and clinical data pre- sents two challenges: first, the accessibility to clinicians of sufficient computing power needed for the analysis of high volume of NGS data; and second, the stringent requirements of accuracy and patient information data governance in a clinical setting. Cloud computing is a natural fit for addressing the computing power requirements, while Clinical Labora- tory Improvement Amendments (CLIA) certification provides a baseline standard for meeting the demands on researchers in working with clinical data. Combining a cloud-computing environment with CLIA certification presents its own challenges due to the level of control users have over the cloud environment and CLIA’s stabi- lity requirements. We have bridged this gap by creating a locked virtual machine with a pre-defined and validated set of workflows. This virtual machine is created using our Galaxy VM launcher tool to instantiate a Galaxy [http://www.usegalaxy.org] environment at Amazon with patient samples were analyzed using customized hybrid- capture bait libraries to boost read coverage in low- coverage regions, followed by targeted enrichment sequencing at the BioMedical Genomics Center. The NGS data is imported to a tested Galaxy single nucleo- tide polymorphism (SNP) detection workflow in a locked Galaxy virtual machine on Amazon’s Elastic Compute Cloud (EC2). This project illustrates our ability to carry out CLIA-certified NGS analysis in the cloud, and will provide valuable guidance in any future implementation of NGS analysis involving clinical diagnosis. Author details 1Research Informatics Support System, Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, MN 55455, USA. 2Division of Genetics and Metabolism, University of Minnesota, Minneapolis, MN 55455, USA. 3Molecular Diagnostics Laboratory, University of Minnesota Medical Center- Fairview, University of Minnesota, Minneapolis, MN 55455, USA. 4BioMedical Genomics Center, University of Minnesota, Minneapolis, MN 55455, USA. 5Department of Laboratory Medicine and Pathology, University of Minnesota, Minneapolis, MN 55455, USA. Published: 1 October 2012 Zhang et al. BMC Proceedings 2012, 6(Suppl 6):P54 http://www.biomedcentral.com/1753-6561/6/S6/P54 CLIA-certified Galaxy pipelines using virtual machines (Minnesota Supercomputing Institute)
  21. A user-friendly cloud environment designed to give researchers access to

    interactive computing and data analysis resources on demand; researchers can create their own “private computing system” within Jetstream Two widely used biology platforms will be supported - Galaxy and iPlant Allow users to preserve VMs with Digital Object Identifiers (DOIs), which enables sharing of results, reproducibility of analyses, and new analyses of published research data.
  22. Share a snapshot of this instance Current support for archiving

    instances with CloudMan Plan to support archiving analyses both from custom 
 Galaxy instances and on Galaxy main Enis Afgan
  23. New approaches for dependency management Alternative approach for installing dependencies:

    Homebrew/Linuxbrew How can we run community contributed tools safely and efficiently? Support for defining dependencies as Docker containers
  24. What is Docker? Docker Virtual Machines “It run proce host

    o sharin conta the re alloca but is and e What is Docker? https://d Traditional Virtual Machine Docker Kernel is shared between containers; achieves the isolation and management benefits on VMs but much more lightweight and efficient
  25. Reproducibility advantages of Docker Standard recipe approach for creating Docker

    containers called a Dockerfile Where VMs are typically a blackbox, the Dockerfile allows inspection of exactly how the container was created; leading to greater transparency
  26. ToolShed and Docker Tools can assert their dependencies are provided

    by a Docker container Potentially tool execution is more secure due to isolation Easier for tool developers to package dependencies Much easier for end-users to get dependencies
  27. For researchers without informatics expertise, the web UI and existing

    tools are often sufficient For informaticians, Galaxy provides an extensive API and wrappers (e.g. Bioblend) But, many users can do some programming, would like the benefits of Galaxy with the flexibility to do some scripting
  28. Docker enables interactive———— environments Framework allows spinning up secure* isolated

    environments, that can interact with the Galaxy history through Galaxy’s API Initial implementation supporting iPython Notebook
  29. Next steps Support for Jupyter (both Python and Julia) and

    RStudio environments Interactive programming environments as first class citizens: full provenance tracking, establish inputs and outputs, be used in workflows, etc. Databases as first class citizens, e.g. GEMINI query interface as a reusable tool
  30. 3) Galaxy users need to work not just with large

    datasets, but large numbers of datasets
  31. Galaxy’s user interface is designed to be simple and intuitive

    for users without informatics expertise Can we scale this user interface to the analysis of hundreds of samples while maintaining interface idioms and usability?
  32. Users typically use many histories when working with many samples;

    New multiple history view makes working with 100s of histories easy Carl Eberhard
  33. A not-so-new feature: mapping over multiple datasets However, this breaks

    down for complex combinations of datasets (e.g. many sets of paired end reads, in replicates)
  34. Operations over collections For “list” collections, existing tools can automatically

    be mapped across the entire collection Existing tools that support multiple inputs and one output act as reducers Many existing tools just work; but “structured” collections like “paired” need explicit support in tools
  35. Map/reduce in workflows More Powerful Workflows Arbitrary # of Inputs

    (... paired). Run applications in parallel (one per input). Merged output for subsequent processing. John Chilton
  36. Dataset Collections Extremely flexible for grouping collections of complex datasets,

    can be nested to arbitrary depth, structure is preserved through mapping More complex reductions, other collection operations in progress Towards 10,000 samples: workflow scheduling improvements (backgrounding, decision points, streaming)
  37. Pluggable visualization framework Similar to tools, new visualizations can be

    dropped into a Galaxy instance Typically a simple server side template to bootstrap a client side visualization Framework for serving data sliced and aggregated in various ways Adaptor for BioJS visualizations in progress Linked visualizations on related data
  38. Stuff that’s coming Backend workflow engine improvements to support the

    much larger analyses that can now be constructed in the UI (ongoing) Increasing complexity and control over how datasets are used Federation between Galaxy instances, support for transparently accessing data from other APIs
  39. PSC, Pittsburgh SDSC, San Diego Galaxy Cluster • 256 cores

    • 2 TB memory Rodeo • 128 cores • 1 TB memory Corral/Stockyard • 20 PB disk Stampede • 462,462 cores • 205 TB memory Blacklight • 4,096 cores • 32 TB memory • Dedicated resources Trestles • 10,368 cores • 20.7 TB memory • Shared resources TACC Austin Nate Coraor
  40. Summary Galaxy is an (obsessively) open framework for making data

    analysis accessible and reproducible Nearly everything in Galaxy is “pluggable”, allowing it to customized for myriad purposes New UI approaches are enabling more complex analysis of much larger numbers of datasets without sacrificing usability By supporting and leveraging tool developers the Galaxy community can collectively keep up with rapid changes in available tools
  41. Galaxy is a community! Join us on irc, mailing lists,

    Galaxy Biostar Contribute code on bitbucket, github, or the ToolShed Join us for a Hackathon or our annual conference Fifth annual Galaxy Community Conference Hackathon, training day, and two days of talks