Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simplifying simulation of single-cell RNA-seq

Luke Zappia
October 31, 2016

Simplifying simulation of single-cell RNA-seq

Single-cell RNA sequencing (scRNA-seq) is rapidly becoming a tool of choice for biologists who wish to investigate gene expression, particularly in areas such as development and differentiation. In contrast to traditional bulk RNA-seq experiments, which measure expression averaged across millions of cells, single-cell experiments can be used to observe how genes are expressed in individual cells. Along with the dramatic increase in resolution provided by scRNA-seq comes an array of bioinformatics challenges. Single-cell data is relatively sparse (for both biological and technical reasons), quality control is difficult and it is unclear how to replicate measurements. The focus of analysis is also different, with more emphasis on clustering cells to identify cell types or ordering of cells to understand dynamic processes than traditional tasks such as differential expression testing. Any new bioinformatics method for scRNA-seq analysis should demonstrate two things: 1) it can do what it claims and 2) it helps to produce biological insight. The first is hard to prove on real data where there is often no known truth. Because of this, bioinformaticians turn to simulations. Unfortunately current scRNA-seq simulations are frequently poorly documented, not reproducible and do not demonstrate similarity to real data or experimental designs. Here we discuss some of the problems with simulating scRNA-seq data and provide a simulation framework that addresses these concerns.

Luke Zappia

October 31, 2016
Tweet

More Decks by Luke Zappia

Other Decks in Science

Transcript

  1. What is single-cell? Matthew Daniels via The Cell Image Library

    http://www.cellimagelibrary.org/images/38912
  2. ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA

    CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA BULK SINGLE-CELL BIOLOGY
  3. Gene Sample 1 A 43 B 3 C 17 D

    24 BULK SINGLE-CELL Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0
  4. Analysis Focus on clustering, lineage tracing Currently > 75 available

    methods goo.gl/4wcVwn github.com/lazappi/single-cell-software
  5. A new analysis method should... 1. Show that it can

    do what it claims 2. Show that it produces insight
  6. Simulations Provide a known truth Allow us to test… •

    Effectiveness • Assumptions • Relative performance
  7. Current simulations Often poorly documented and explained Not easily reproducible

    or reusable Don’t demonstrate similarity to real data
  8. Splatter R package Collection of simulation methods Consistent, easy to

    use, interface Functions for comparison github.com/Oshlack/splatter
  9. Other simulations Simple - Negative binomial Lun - NB with

    cell factors Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. bioRxiv (2016). DOI: 10.1101/073973. scDD - NB with bimodality Korthauer KD, et al. bioRxiv (2015). DOI: 10.1101/035501.
  10. Using Splatter params1 <- splatEstimate(real.data) sim1 <- splatSimulate(params1, ...) params2

    <- simpleEstimate(real.data) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets)
  11. Subset of data from study looking at design and batch

    effects by Tung et al. Tung P-Y, et al. bioRxiv (2016). DOI: 10.1101/062919. Single HapMap stem cell line - 3 batches - 221 cells, 13 058 genes Real data
  12. Summary Single-cell RNA-seq is an exciting new technology - Lots

    of analysis methods Simulations can be used to evaluate methods - But often hard to reuse Splatter - R package for simulation and comparison Splat - Simulation method for groups or paths