Upgrade to Pro — share decks privately, control downloads, hide ads and more …

gi2017: Simulation and analysis tools for single-cell RNA sequencing data

Luke Zappia
November 01, 2017

gi2017: Simulation and analysis tools for single-cell RNA sequencing data

Single-cell RNA sequencing (scRNA-seq) is rapidly becoming a tool of choice for biologists who wish to investigate gene expression. In contrast to traditional bulk RNA-seq experiments, which measure expression averaged across millions of cells, single-cell experiments can be used to observe how genes are expressed in individual cells. Along with the dramatic increase in resolution provided by scRNA-seq comes an array of bioinformatics challenges. Single-cell data is relatively sparse (for both biological and technical reasons), quality control is difficult and it is unclear how to replicate measurements. Researchers have risen to address these challenges and there are currently more than 125 software tools available for analysing scRNA-seq data. We have catalogued these software tools in the scRNA-tools database (www.scRNA-tools.org). Analysis of this database shows that there are now methods available for a wide range of tasks, from pre-processing unique molecular identifiers to detecting allele-specific expression. However, the biggest areas of development have been in clustering cells to identify cell types and ordering of cells to understand dynamic processes. We also find that the R statistical programming language is the most popular platform for scRNA-seq analysis tools, followed by Python, and that the majority of tools have been described in peer-reviewed papers or preprints and are available under open-source software licenses.

With the ever increasing number of analysis methods available it is important to be able to assess and compare the performance, quality and limitations of an analysis tool. This is often done, at least in part, by testing methods on simulated datasets where the true answers are known. Unfortunately, current scRNA-seq simulations are frequently poorly documented, not reproducible and do not demonstrate similarity to real data or experimental designs. To address these concerns we have developed Splatter, a Bioconductor R package for reproducible simulation of scRNA-seq datasets. Splatter is a simulation framework that currently includes four previously published simulation models, allowing users to estimate parameters from real data in order to easily generate realistic synthetic scRNA-seq datasets. Here we discuss some of the challenges of simulating scRNA-seq data and present a comparison of the simulation methods available in Splatter (bioconductor.org/packages/splatter). As part of Splatter we also introduce our own simulation model, Splat, capable of reproducing scRNA-seq datasets with multiple groups of cells, differentiation paths or batch effects.

Luke Zappia

November 01, 2017
Tweet

More Decks by Luke Zappia

Other Decks in Science

Transcript

  1. Single-cell RNA-seq ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA

    ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA Gene Cell Cell Cell Cell A B C D
  2. Gene Cell Cell Cell Cell A B C D Bad

    cell? Low expression? Cell type specific? Cell cycle? Dropout?
  3. Provide a truth to test against BUT - Often poorly

    documented and explained - Not easily reproducible or reusable - Don’t demonstrate similarity to real data Simulations
  4. Simple - Negative binomial Lun - NB with cell factors

    Lun ATL, Bach K, Marioni JC. Genome Biology ( ). DOI: . /s - - - . Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. Biostatistics ( ). DOI: . /biostatistics/kxw . scDD - NB with bimodality Korthauer KD, et al. Genome Biology ( ). DOI: . /s - - -y. BASiCS - NB with spike-ins Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. ( ). DOI: . /journal.pcbi. . Simulations
  5. . Estimate . Simulate . Compare params1 <- splatEstimate(real.data) params2

    <- simpleEstimate(real.data) sim1 <- splatSimulate(params1, ...) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets) diff <- diffSCESets(datasets, ref = “Real”) Using Splatter
  6. Real data HapMap individuals plates each random cells Tung P-Y

    et al. Sci. Rep. ( ) DOI: . /srep A A A A B B B B C C C C Tung et al. iPSCs, C capture
  7. SingleCellExperiment Batch effects Simulations - BASiCS - mfa - PhenoPath

    - ZINB-WaVE New in Splatter . . Bioconductor 3.6
  8. Many tools for scRNA-seq analysis Catalogued in the scRNA-tools database

    Can be tested using synthetic datasets Splatter is our package for simulating scRNA-seq data Summary
  9. @_lazappi_ oshlacklab.com Supervisors Alicia Oshlack Melissa Little MCRI Bioinformatics Belinda

    Phipson Breon Schmidt Everyone that makes tools and data available www.scRNA-tools.org @scRNAtools “Splatter: simulation of single-cell RNA sequencing data.” Genome Biology ( ) DOI: . /s - - - “Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database” bioRxiv ( ) DOI: . / bioconductor.org/packages/ splatter