Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BiocAsia 2017

Luke Zappia
November 17, 2017

BiocAsia 2017

Single-cell RNA sequencing (scRNA-seq) has opened up a range of opportunities for investigating the transcriptome, but with the dramatic increase in resolution comes an array of bioinformatics challenges. Single-cell data is relatively sparse (for both biological and technical reasons), quality control is difficult and it is unclear if methods designed for bulk RNA-seq are appropriate for scRNA-seq data. Researchers have risen to address these challenges and there are now more than 140 scRNA-seq analysis tools available. However, with so many tools available researchers are faced with the difficult task of choosing which to use, making it important to be able to assess and compare the performance, quality and limitations of each tool. One common approach is to test methods on simulated datasets where the true answers are known. To aid this process we have developed Splatter, a Bioconductor R package for reproducible simulation of scRNA-seq datasets
(bioconductor.org/packages/splatter).

Splatter is a simulation framework that provides access to a variety of simulation models, allowing users to estimate parameters from real data in order to easily generate realistic synthetic scRNA-seq datasets. As part of Splatter we also introduce our own simulation model, Splat, capable of reproducing scRNA-seq datasets with multiple groups of cells, differentiation paths or batch effects. Here we will discuss some how Splatter can be used to develop and compare analysis tools. We will also touch on our experience developing Splatter, some of the design choices we made and how we have integrated other Bioconductor packages such as the SingleCellExperiment class.

Luke Zappia

November 17, 2017
Tweet

More Decks by Luke Zappia

Other Decks in Science

Transcript

  1. Splatter, a package for
    simulating single-cell
    RNA sequencing data
    Luke Zappia
    @_lazappi_

    View Slide

  2. Single-cell RNA-seq
    ACTGACTCCA
    TCAGTACTGA
    CGTGTCATAG
    GATTGACCTA
    ACTGACTCCA
    TCAGTACTGA
    CGTGTCATAG
    GATTGACCTA
    ACTGACTCCA
    TCAGTACTGA
    CGTGTCATAG
    GATTGACCTA
    ACTGACTCCA
    TCAGTACTGA
    CGTGTCATAG
    GATTGACCTA
    Gene Cell 1 Cell 2 Cell 3 Cell 4
    A 12 10 9 0
    B 0 0 0 1
    C 9 6 0 0
    D 7 0 4 0

    View Slide

  3. Sevensson et al. arXiv 1704.01379, 2017

    View Slide

  4. Gene Cell 1 Cell 2 Cell 3 Cell 4
    A 12 0 10 9 0
    B 0 0 0 1
    C 9 6 0 0
    D 7 0 4 0
    Bad cell?
    Low expression?
    Cell type specific?
    Cell cycle?
    Dropout?

    View Slide

  5. View Slide

  6. View Slide

  7. View Slide

  8. www. .org

    View Slide

  9. Provide a truth to test against
    BUT
    - Often poorly documented and explained
    - Not easily reproducible or reusable
    - Don’t demonstrate similarity to real data
    Simulations

    View Slide

  10. “Splatter: simulation of single-cell RNA sequencing data.”
    Genome Biology (2017) DOI: 10.1186/s13059-017-1305-0

    View Slide

  11. The idea...
    Multiple simulations, same interface
    Real data Parameters Dataset
    Estimation Simulation

    View Slide

  12. Building a package
    Bioconductor
    Devtools
    Roxygen
    testthat
    Checkmate
    GitHub
    Codecov (covr)
    Travis/Appveyor?

    View Slide

  13. Checkmate
    fact if (length(n) != 1)
    stop("Argument 'n' must have length 1")
    if (!is.numeric(n))
    stop("Argument 'n' must be numeric")
    if (is.na(n))
    stop("Argument 'n' may not be NA")
    if (is.double(n)) {
    if (is.nan(n))
    stop("Argument 'n' may not be NaN")
    if (is.infinite(n))
    stop("Argument 'n' must be finite")
    if (abs(n - round(n, 0)) > sqrt(.Machine$double.eps))
    stop("Argument 'n' must be an integerish value")
    n }
    fact assertCount(n)
    assertChoice(method,
    c("stirling", "factorial"))
    if (method == "factorial")
    factorial(n)
    else
    sqrt(2 * pi * n) * (n / exp(1))^n
    }

    View Slide

  14. Counts
    Parameters
    Estimation

    View Slide

  15. Storing parameters
    Custom object
    Same structure, names
    - One object?
    Different objects

    View Slide

  16. Params
    SimpleParams
    SplatParams
    LunParams
    SCDDParams
    (Virtual)

    View Slide

  17. View Slide

  18. View Slide

  19. Parameters
    Dataset
    Simulation

    View Slide

  20. Storing simulations
    Simulated counts
    Intermediate values
    Parameters?

    View Slide

  21. SCESet
    assayData
    featureData
    phenoData
    AnnotatedDataFrame
    AnnotatedDataFrame
    Matrix
    ExpressionSet

    View Slide

  22. SingleCellExperiment
    assays
    rowData
    colData
    DataFrame
    DataFrame
    Matrix
    SummarizedExperiment
    metadata
    List

    View Slide

  23. View Slide

  24. Science!

    View Slide

  25. Splat
    Negative binomial
    Expression outliers
    Defined library sizes
    Mean-variance trend
    Dropout

    View Slide

  26. Simple - Negative binomial
    Lun - NB with cell factors
    Lun ATL, Bach K, Marioni JC. Genome Biology (2016).
    DOI: 10.1186/s13059-016-0947-7.
    Lun 2 - Sampled NB with batch effects
    Lun ATL, Marioni JC. Biostatistics (2017).
    DOI: 10.1093/biostatistics/kxw055.
    scDD - NB with bimodality
    Korthauer KD, et al. Genome Biology (2016).
    DOI: 10.1186/s13059-016-1077-y.
    BASiCS - NB with spike-ins
    Vallejos CA, Marioni JC, Richardson S. PLoS Comp. Bio. (2015).
    DOI: 10.1371/journal.pcbi.1004333.
    Simulations

    View Slide

  27. 1. Estimate
    2. Simulate
    3. Compare
    params1 params2 sim1 sim2 datasets Splat = sim1,
    Simple = sim2)
    comp diff Using Splatter

    View Slide

  28. Means Difference in Means

    View Slide

  29. Zeros per cell Difference in zeros

    View Slide

  30. Mean-zeros Difference

    View Slide

  31. Rank 1 8

    View Slide

  32. Rank 1 8
    Full-length
    Full-length

    View Slide

  33. Complex simulations
    Groups Batches Paths

    View Slide

  34. SingleCellExperiment
    Batch effects
    Simulations
    - BASiCS
    - mfa
    - PhenoPath
    - ZINB-WaVE
    New in Splatter 1.2.0
    Bioconductor 3.6

    View Slide

  35. Many tools for scRNA-seq analysis
    Catalogued in the scRNA-tools database
    Can be tested using synthetic datasets
    Splatter is our package for simulating scRNA-seq data
    Making a package is not as hard as you think
    Summary

    View Slide

  36. @_lazappi_
    oshlacklab.com
    Supervisors
    Alicia Oshlack
    Melissa Little
    MCRI Bioinformatics
    Belinda Phipson
    Breon Schmidt
    Everyone that makes tools and data available
    www.scRNA-tools.org
    @scRNAtools
    “Splatter: simulation of
    single-cell RNA sequencing
    data.”
    Genome Biology (2017)
    DOI:
    10.1186/s13059-017-1305-0
    “Exploring the single-cell
    RNA-seq analysis landscape
    with the scRNA-tools database”
    bioRxiv (2017)
    DOI: 10.1101/206573
    bioconductor.org/packages/
    splatter

    View Slide

  37. Negative binomial

    View Slide

  38. Real data
    3 HapMap individuals
    3 plates each
    200 random cells
    Tung P-Y et al. Sci. Rep. (2017) DOI:10.1038/srep39921
    A1 A2 A3
    A
    B1 B2 B3
    B
    C1 C2 C3
    C
    Tung et al. iPSCs, C1 capture

    View Slide