Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simplifying simulation of single-cell RNA-seq

9d81fd2d95185ac557a4a6a1e2139657?s=47 Luke Zappia
October 31, 2016

Simplifying simulation of single-cell RNA-seq

Single-cell RNA sequencing (scRNA-seq) is rapidly becoming a tool of choice for biologists who wish to investigate gene expression, particularly in areas such as development and differentiation. In contrast to traditional bulk RNA-seq experiments, which measure expression averaged across millions of cells, single-cell experiments can be used to observe how genes are expressed in individual cells. Along with the dramatic increase in resolution provided by scRNA-seq comes an array of bioinformatics challenges. Single-cell data is relatively sparse (for both biological and technical reasons), quality control is difficult and it is unclear how to replicate measurements. The focus of analysis is also different, with more emphasis on clustering cells to identify cell types or ordering of cells to understand dynamic processes than traditional tasks such as differential expression testing. Any new bioinformatics method for scRNA-seq analysis should demonstrate two things: 1) it can do what it claims and 2) it helps to produce biological insight. The first is hard to prove on real data where there is often no known truth. Because of this, bioinformaticians turn to simulations. Unfortunately current scRNA-seq simulations are frequently poorly documented, not reproducible and do not demonstrate similarity to real data or experimental designs. Here we discuss some of the problems with simulating scRNA-seq data and provide a simulation framework that addresses these concerns.

9d81fd2d95185ac557a4a6a1e2139657?s=128

Luke Zappia

October 31, 2016
Tweet

Transcript

  1. Simplifying simulation of single-cell RNA-seq Luke Zappia @_lazappi_

  2. What is single-cell? Matthew Daniels via The Cell Image Library

    http://www.cellimagelibrary.org/images/38912
  3. ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA

    CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA ACTGACTCCA TCAGTACTGA CGTGTCATAG GATTGACCTA BULK SINGLE-CELL BIOLOGY
  4. Gene Sample 1 A 43 B 3 C 17 D

    24 BULK SINGLE-CELL Gene Cell 1 Cell 2 Cell 3 Cell 4 A 12 10 9 0 B 0 0 1 4 C 9 6 0 0 D 7 0 4 0
  5. Analysis Focus on clustering, lineage tracing Currently > 75 available

    methods goo.gl/4wcVwn github.com/lazappi/single-cell-software
  6. A new analysis method should... 1. Show that it can

    do what it claims 2. Show that it produces insight
  7. http://www.cellimagelibrary.org/images/40483 M Uhlen et al. via The Cell Image Library

    Simulations
  8. Gene Cell 1 Cell 2 Cell 3 Cell 4 A

    B C D
  9. Simulations Provide a known truth Allow us to test… •

    Effectiveness • Assumptions • Relative performance
  10. Current simulations Often poorly documented and explained Not easily reproducible

    or reusable Don’t demonstrate similarity to real data
  11. None
  12. Splatter R package Collection of simulation methods Consistent, easy to

    use, interface Functions for comparison github.com/Oshlack/splatter
  13. The Splat simulation Negative binomial Expression outliers Defined library sizes

    Mean-variance trend Dropout
  14. Other simulations Simple - Negative binomial Lun - NB with

    cell factors Lun ATL, Bach K, Marioni JC. Genome Biology (2016). DOI: 10.1186/s13059-016-0947-7. Lun 2 - Sampled NB with batch effects Lun ATL, Marioni JC. bioRxiv (2016). DOI: 10.1101/073973. scDD - NB with bimodality Korthauer KD, et al. bioRxiv (2015). DOI: 10.1101/035501.
  15. Using Splatter params1 <- splatEstimate(real.data) sim1 <- splatSimulate(params1, ...) params2

    <- simpleEstimate(real.data) sim2 <- simpleSimulate(params2, ...) datasets <- list(Real = real.data, Splat = sim1, Simple = sim2) comp <- compareSCESets(datasets)
  16. The data http://www.cellimagelibrary.org/images/41467 Jan Schmoranzer via The Cell Image Library

  17. Subset of data from study looking at design and batch

    effects by Tung et al. Tung P-Y, et al. bioRxiv (2016). DOI: 10.1101/062919. Single HapMap stem cell line - 3 batches - 221 cells, 13 058 genes Real data
  18. Means

  19. Variance

  20. Mean-variance

  21. Library size

  22. Zeros

  23. What else? http://www.cellimagelibrary.org/images/44701 Andres J Garcia and Ankur Singh via

    The Cell Image Library
  24. Groups

  25. Paths

  26. Summary Single-cell RNA-seq is an exciting new technology - Lots

    of analysis methods Simulations can be used to evaluate methods - But often hard to reuse Splatter - R package for simulation and comparison Splat - Simulation method for groups or paths
  27. Acknowledgements Alicia Oshlack Melissa Little Belinda Phipson MCRI Bioinformatics oshalacklab.com

  28. github.com/Oshlack/splatter github.com/lazappi/single-cell-software @_lazappi_ oshalacklab.com

  29. Solution? http://www.cellimagelibrary.org/images/38804 Wellcome Images via The Cell Image Library

  30. Library size

  31. Means

  32. Negative binomial