Slide 1

Slide 1 text

Welcome to the World of Single-Cell RNA-Sequencing Stephanie Hicks Dana-Farber Cancer Institute / Harvard SPH @stephaniehicks Spring 2017 Single-Cell Sequencing Nanocourse March 7, 2017 If you want to follow along, my slides & code are available here: https://github.com/stephaniehicks/singlecellnano2017 1

Slide 2

Slide 2 text

Game plan: 2 •  scRNA-Seq versus bulk RNA-Seq? •  Technologies used to sequence single-cells •  ApplicaDons of scRNA-Seq data •  Biological versus technical variability •  Raw, noisy data à clean data? (e.g. quality control, normalizaDon) •  Intro to experimental design (from the staDsDcal perspecDve) •  How batch effects can occur in single-cell RNA-Seq data •  A case study using R/Bioconductor

Slide 3

Slide 3 text

Single-cell RNA-Seq (scRNA-Seq) Cell 1 Cell 2 … Gene 1 18 0 Gene 2 1010 506 Gene 3 0 49 Gene 4 22 0 … Read Counts Gene 1 Compare gene expression profiles of single cells Tissue (e.g. tumor) Isolate and sequence individual cells Cells Genes Principal Component 2 Principal Component 1 Cell 1 3

Slide 4

Slide 4 text

scRNA-Seq vs bulk RNA-Seq Korthauer et al. (2016). Genome Biology 4

Slide 5

Slide 5 text

Kolodziejczyk et al. (2015). Molecular Cell 58 5

Slide 6

Slide 6 text

Kolodziejczyk et al. (2015). Molecular Cell 58 6

Slide 7

Slide 7 text

Common types scRNA-Seq data Adapted from Kolodziejczyk et al. (2015). Molecular Cell 58 Heterogeneous cell populaDons Purified cell populaDons Ba]le droids Super ba]le droids! Mixed bag of R-Series droids 7

Slide 8

Slide 8 text

Common applicaDons using scRNA-Seq data Adapted from Kolodziejczyk et al. (2015). Molecular Cell 58 CharacterizaDon of cell type populaDons IdenDfy cell type populaDons (e.g. dim reducDon or clustering) DifferenDal splicing between populaDons IdenDfy allele-specific expression IdenDfy genes that drive a process across Dme 8

Slide 9

Slide 9 text

9 Variability in scRNA-Seq data Kolodziejczyk et al. (2015). Molecular Cell 58 Signal Noise overdispersion, batch effects, sequencing depth, gc bias, amplificaDon bias Extrinsic noise (regulaDon by transcripDon factors) vs IntrinsIc noise (stochasDc bursDng/firing, cell cycle) capture efficiency (starDng amount of mRNA) Variability visible in bulk RNA-Seq AddiDonal variability in scRNA-Seq data (and var from bulk)

Slide 10

Slide 10 text

Going from “raw” data to “clean” data Taken from Davis McCarthy’s Slides at Genome InformaIcs 2016 h]ps://speakerdeck.com/davismcc/what-do-we-need- computaDonally-to-make-the-most-of-single-cell-rna-seq-data 10

Slide 11

Slide 11 text

Quality Control Adapted from Stegle et al. (2015) Nature Reviews Gene8cs 16: 133-145 Lun et al. (2016) F1000 Cell-level quality control Gene-level quality control 11

Slide 12

Slide 12 text

So… what about normalizaIon and dealing with other technical variaIon in scRNA-Seq data? Much to learn, you sDll have …. 12

Slide 13

Slide 13 text

NormalizaDon •  Without Spike-ins or UMIs – Between-sample normalizaDon methods •  Global scaling factors mostly developed for bulk RNA-Seq •  Number of zeros (see Lun et al., 2016. Genome Biology) •  With Spike-ins or UMIs –  Spike-ins: theoreDcally a good idea, but many challenges sDll remain for scRNA-Seq (see Stegle et al., 2015, Tung et al., 2016); ConflicDng view points on if ERCCs are appropriate –  UMIs: Reduces amplificaDon bias, not appropriate for isoform or allele-specific expression •  Biological (nuisance?) variability –  differences among cells in cell-cycle stage or cell size 13

Slide 14

Slide 14 text

“Hey, someone told me of this thing called batch effects…. Should I be worried about them in my scRNA-Seq data?” 14

Slide 15

Slide 15 text

Patel et al. (2014) Science Cells cluster by tumor 15

Slide 16

Slide 16 text

Verhaak et al. (2010). Cancer Cell 16

Slide 17

Slide 17 text

Leek et al. (2010) Nat Reviews Genetics Batch effects in genomics data 17

Slide 18

Slide 18 text

Processing*Batch* Biological*Group* Completely*confounded*study*design* Balanced*study*design* Rep:*1* Rep:*2* Rep:*2* Rep:*2* Rep:*1* Rep:*1* Group:*1* Group:*2* Group:*3* Batch:*3* Batch:*1* Batch:*2* Batch:*3* Observed*Differences* We*cannot*determine*if* variaCon*is*driven*by* biology*or*batch*effects* ● ● Batch Batch 1 Batch 2 Batch 3 ● ● Batch Batch 1 Batch 2 Batch 3 ● ● ● ● ● ● Batch Batch 1 Batch 2 Batch 3 Batch:*2* Batch:*1* Group:*3* Group:*1* Group:*2* The$Problem$of$Confounding$Biological$Varia6on$and$Batch$Effects$ ProporCon*of*detected*genes* ProporCon*of*detected*genes* Principal*Component*2* Principal*Component*2* Principal*Component*1* Principal*Component*1* Principal*Component*2* Principal*Component*1* ProporCon*of*detected*genes* Group*1* Group*1* Group*2* Group*3* Group*2* Group*3* Group*1* Group*2* Group*3* Good* Bad* When cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition Confounded study design Batch effects in single-cell RNA-Seq data 18

Slide 19

Slide 19 text

Processing*Batch* Biological*Group* Completely*confounded*study*design* Balanced*study*design* Rep:*1* Rep:*2* Rep:*2* Rep:*2* Rep:*1* Rep:*1* Group:*1* Group:*2* Group:*3* Batch:*3* Batch:*1* Batch:*2* Batch:*3* Observed*Differences* We*cannot*determine*if* variaCon*is*driven*by* biology*or*batch*effects* ● ● Batch Batch 1 Batch 2 Batch 3 ● ● Batch Batch 1 Batch 2 Batch 3 ● ● ● ● ● ● Batch Batch 1 Batch 2 Batch 3 Batch:*2* Batch:*1* Group:*3* Group:*1* Group:*2* The$Problem$of$Confounding$Biological$Varia6on$and$Batch$Effects$ ProporCon*of*detected*genes* ProporCon*of*detected*genes* Principal*Component*2* Principal*Component*2* Principal*Component*1* Principal*Component*1* Principal*Component*2* Principal*Component*1* ProporCon*of*detected*genes* Group*1* Group*1* Group*2* Group*3* Group*2* Group*3* Group*1* Group*2* Group*3* Good* Bad* When cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition More balanced study design: Cells from different biological group processed in same batch Batch effects in single-cell RNA-Seq data 19

Slide 20

Slide 20 text

Processing*Batch* Biological*Group* Completely*confounded*study*design* Balanced*study*design* Rep:*1* Rep:*2* Rep:*2* Rep:*2* Rep:*1* Rep:*1* Group:*1* Group:*2* Group:*3* Batch:*3* Batch:*1* Batch:*2* Batch:*3* Observed*Differences* We*cannot*determine*if* variaCon*is*driven*by* biology*or*batch*effects* ● ● Batch Batch 1 Batch 2 Batch 3 ● ● Batch Batch 1 Batch 2 Batch 3 ● ● ● ● ● ● Batch Batch 1 Batch 2 Batch 3 Batch:*2* Batch:*1* Group:*3* Group:*1* Group:*2* The$Problem$of$Confounding$Biological$Varia6on$and$Batch$Effects$ ProporCon*of*detected*genes* ProporCon*of*detected*genes* Principal*Component*2* Principal*Component*2* Principal*Component*1* Principal*Component*1* Principal*Component*2* Principal*Component*1* ProporCon*of*detected*genes* Group*1* Group*1* Group*2* Group*3* Group*2* Group*3* Group*1* Group*2* Group*3* Good* Bad* When cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition More balanced study design: Cells from different biological group processed in same batch Batch effects in single-cell RNA-Seq data Proposed by Tung et al. (2017) Scien8fic Reports 20

Slide 21

Slide 21 text

Processing*Batch* Biological*Group* Completely*confounded*study*design* Balanced*study*design* Rep:*1* Rep:*2* Rep:*2* Rep:*2* Rep:*1* Rep:*1* Group:*1* Group:*2* Group:*3* Batch:*3* Batch:*1* Batch:*2* Batch:*3* Observed*Differences* We*cannot*determine*if* variaCon*is*driven*by* biology*or*batch*effects* ● ● Batch Batch 1 Batch 2 Batch 3 ● ● Batch Batch 1 Batch 2 Batch 3 ● ● ● ● ● ● Batch Batch 1 Batch 2 Batch 3 Batch:*2* Batch:*1* Group:*3* Group:*1* Group:*2* The$Problem$of$Confounding$Biological$Varia6on$and$Batch$Effects$ ProporCon*of*detected*genes* ProporCon*of*detected*genes* Principal*Component*2* Principal*Component*2* Principal*Component*1* Principal*Component*1* Principal*Component*2* Principal*Component*1* ProporCon*of*detected*genes* Group*1* Group*1* Group*2* Group*3* Group*2* Group*3* Group*1* Group*2* Group*3* Good* Bad* When cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition More balanced study design: Cells from different biological group processed in same batch Batch effects in single-cell RNA-Seq data Implemented by Tung et al. (2017) Scien8fic Reports 21

Slide 22

Slide 22 text

Use FASTQ header as a surrogate for batch http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/ swSEQ_mCA_FASTQFiles.htm 22

Slide 23

Slide 23 text

Processing*Batch* Biological*Group* Completely*confounded*study*design* Balanced*study*design* Rep:*1* Rep:*2* Rep:*2* Rep:*2* Rep:*1* Rep:*1* Group:*1* Group:*2* Group:*3* Batch:*3* Batch:*1* Batch:*2* Batch:*3* Observed*Differences* We*cannot*determine*if* variaCon*is*driven*by* biology*or*batch*effects* ● ● Batch Batch 1 Batch 2 Batch 3 ● ● Batch Batch 1 Batch 2 Batch 3 ● ● ● ● ● ● Batch Batch 1 Batch 2 Batch 3 Batch:*2* Batch:*1* Group:*3* Group:*1* Group:*2* The$Problem$of$Confounding$Biological$Varia6on$and$Batch$Effects$ ProporCon*of*detected*genes* ProporCon*of*detected*genes* Principal*Component*2* Principal*Component*2* Principal*Component*1* Principal*Component*1* Principal*Component*2* Principal*Component*1* ProporCon*of*detected*genes* Group*1* Group*1* Group*2* Group*3* Group*2* Group*3* Group*1* Group*2* Group*3* Good* Bad* Hicks et al. (2015) bioRxiv 23

Slide 24

Slide 24 text

Cells cluster by tumor Hicks et al. (2015) bioRxiv 24

Slide 25

Slide 25 text

Cells cluster by batch (tumors are confounded with batch) Hicks et al. (2015) bioRxiv 25

Slide 26

Slide 26 text

Hicks et al. (2015) bioRxiv 26

Slide 27

Slide 27 text

Different batches have different detecDon rates Hicks et al. (2015) bioRxiv 27

Slide 28

Slide 28 text

Bad news: Good news: Batch effects and methods to correct for batch effects have been around for many years (lots of places to start). Bad news: Poor experimental design is a big limiDng factor. …. also, more complicated because of sparsity (biology and technology), capture efficiency, etc Good news: Increase awareness about good experimental design. New methods specific for scRNA-Seq are being developed. Batch effects can be a big problem in scRNA-Seq data (but not always). 28

Slide 29

Slide 29 text

Batch CorrecDon for scRNA-Seq data •  Methods for microarrays or bulk RNA-Seq – linear mixed models (requires technical replicates) – ComBat •  Methods for scRNA-Seq: – DifferenDal expression (just a few): •  SCDE, MAST, scDD, BASiCs, M3Drop – More generalized •  Scone, scater 29

Slide 30

Slide 30 text

DemonstraDon on how to correct for batch effects in an unconfounded study design Data from Tung et al. (2017) Scien8fic Reports Complete analysis in R Markdown on GitHub here: h]ps://github.com/stephaniehicks/singlecellnano2017 30

Slide 31

Slide 31 text

Plot cells along first two Principal Components −5 0 5 −5 0 5 Component 1: 3% variance Component 2: 2% variance replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500 31 Complete analysis in R Markdown on GitHub here: https://github.com/stephaniehicks/singlecellnano2017

Slide 32

Slide 32 text

t-SNE How to use t-SNE effectively http://distill.pub/2016/misread-tsne/ −30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 Dimension 1 Dimension 2 replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500 Perplexity = 2 −20 −10 0 10 20 −10 0 10 Dimension 1 Dimension 2 replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500 Perplexity = 5 −2.5 0.0 2.5 −2 0 2 4 Dimension 1 Dimension 2 replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500 Perplexity = 25 −2 −1 0 1 2 3 −2 0 2 Dimension 1 Dimension 2 replicate r1 r2 r3 total_features 7000 7500 8000 8500 9000 9500 Perplexity (default) 32 Complete analysis in R Markdown on GitHub here: https://github.com/stephaniehicks/singlecellnano2017

Slide 33

Slide 33 text

PCA (post-batch correcDon) −5 0 5 −5 0 5 Component 1: 3% variance Component 2: 2% variance total_features 7000 7500 8000 8500 9000 9500 replicate r1 r2 r3 PCA − no normalisation −10 −5 0 5 −10 0 10 Component 1: 2% variance Component 2: 1% variance total_features 7000 7500 8000 8500 9000 9500 replicate r1 r2 r3 PCA − after batch correction and normalization 33 Complete analysis in R Markdown on GitHub here: https://github.com/stephaniehicks/singlecellnano2017

Slide 34

Slide 34 text

34

Slide 35

Slide 35 text

Support from NIH R01 grants GM083084, RR021967/GM103552, HG005220 NIH K99/R00 grant HG009007 Rafael Irizarry 35 QuesDons?