SOBP 2017 - Speaker Deck

Slide 1

Slide 1 text

11 RNA-seq samples beyond the known transcriptome with derfinder available via recount Leonardo Collado-Torres @fellgernon #SOBP2017

Slide 2

Slide 2 text

Genome Transcripts Reads slide adapted from Jeff Leek

Slide 3

Slide 3 text

Genome slide adapted from Jeff Leek

Slide 4

Slide 4 text

GTEx TCGA slide adapted from Shannon Ellis

Slide 5

Slide 5 text

SRA

Slide 6

Slide 6 text

Slide adapted from Ben Langmead

Slide 7

Slide 7 text

http://rail.bio/ Slide adapted from Ben Langmead

Slide 8

Slide 8 text

http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

Slide 9

Slide 9 text

Obstacle: our research moves (spot) markets Spike in market price due to preprocessing job flows slide adapted from Jeff Leek

Slide 10

Slide 10 text

Obstacle: our research moves (spot) markets Weekday market volatility Weekend EC2 inactivity slide adapted from Jeff Leek

Slide 11

Slide 11 text

https://jhubiostatistics.shinyapps.io/recount/

Slide 12

Slide 12 text

slide adapted from Andrew Jaffe

Slide 13

Slide 13 text

slide adapted from Andrew Jaffe

Slide 14

Slide 14 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 15

Slide 15 text

slide adapted from Jeff Leek

Slide 16

Slide 16 text

>library('recount') > download_study('SRP029880', type='rse-gene') > download_study('SRP059039', type='rse-gene') > load(file.path('SRP029880 ', 'rse_gene.Rdata')) > load(file.path('SRP059039', 'rse_gene.Rdata')) > mdat <- do.call(cbind, dat) https://github.com/leekgroup/recount-analyses/

Slide 17

Slide 17 text

Collado Torres et al. Nat. Biotech 2017

Slide 18

Slide 18 text

slide adapted from Andrew Jaffe

Slide 19

Slide 19 text

slide adapted from Andrew Jaffe

Slide 20

Slide 20 text

coverage vector 2 6 0 11 6 Genome (DNA) RNA-Sequencing: Alignment using Rail-RNA Nellore et al. (2016) Bioinformatics

Slide 21

Slide 21 text

Collado-Torres et al, NAR, 2017

Slide 22

Slide 22 text

Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data

Slide 23

Slide 23 text

Jaffe et al, Nat. Neuroscience, 2015

Slide 24

Slide 24 text

DERs outside of “known genes” Jaffe et al, Nat. Neuroscience, 2015

Slide 25

Slide 25 text

CBC: 28 MD: 24 STR: 28 AMY: 31 HIP: 32 DFC: 34 Total N samples: 487 BrainSpan data Coverage Data from BrainSpan: http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/ VFC: 30 MFC: 32 OFC: 30 M1C: 25 S1C: 26 IPC: 33 A1C: 30 STC: 35 ITC: 33 V1C: 33

Slide 26

Slide 26 text

BrainSpan data Jaffe et al, Nat. Neuroscience, 2015

Slide 27

Slide 27 text

Percent Expressed Mean reads across GTEx

Slide 28

Slide 28 text

> library('recount') > regions_list <- bplapply(chrs, function(chr) { regs <- expressed_regions('SRP012682', chr, cutoff = 5L) return(regs) }, BPPARAM = bp) > names(regions_list) <- chrs > regions <- unlist(GRangesList(regions_list)) https://github.com/leekgroup/recount-analyses/

Slide 29

Slide 29 text

> library('recount') > covMat <- bplapply(chrs, function(chr) { coverageMatrix <- coverage_matrix('SRP012682’', chr, regions_list[[chr]]) return(coverageMatrix) }, BPPARAM = bp) > covMat <- do.call(rbind, covMat) https://github.com/leekgroup/recount-analyses/

Slide 30

Slide 30 text

expression data for ~70,000 human samples GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis

Slide 31

Slide 31 text

expression data for ~70,000 human samples Answer meaningful questions about human biology and expression GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis

Slide 32

Slide 32 text

expression data for ~70,000 human samples samples phenotypes ? GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis

Slide 33

Slide 33 text

Category Frequency F 95 female 2036 Female 51 M 77 male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Se x “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis

Slide 34

Slide 34 text

SRA phenotype information is far from complete SubjectID Sex Tissue Race Age 6620 NA female liver NA NA 6621 NA female liver NA NA 6622 NA female liver NA NA 6623 NA female liver NA NA 6624 NA female liver NA NA 6625 NA male liver NA NA 6626 NA male liver NA NA 6627 NA male liver NA NA 6628 NA male liver NA NA 6629 NA male liver NA NA 6630 NA male liver NA NA 6631 NA NA blood NA NA 6632 NA NA blood NA NA 6633 NA NA blood NA NA 6634 NA NA blood NA NA 6635 NA NA blood NA NA 6636 NA NA blood NA NA z z z z slide adapted from Shannon Ellis

Slide 35

Slide 35 text

slide adapted from Jeff Leek

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 38

Slide 38 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 39

Slide 39 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis

Slide 40

Slide 40 text

phenopredict Expression Data Covariate Informatio n Genomic Region Information Pheno of Interest n p regions x individuals Input Data select_regions() build_predictor() test_predictor() extract_data() predict_pheno() functions slide adapted from Shannon Ellis

Slide 41

Slide 41 text

select_regions() Output: Coverage matrix (data.frame) Region information (GRanges) slide adapted from Shannon Ellis

Slide 42

Slide 42 text

Sex prediction is accurate across data sets Number of Regions 20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis

Slide 43

Slide 43 text

Sex prediction is accurate across data sets Number of Regions 20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis

Slide 44

Slide 44 text

http://www.rna-seqblog.com/ Can we use expression data to predict tissue? slide adapted from Shannon Ellis

Slide 45

Slide 45 text

Number of Regions 589 589 589 589 Number of Samples (N) 4,769 4,769 7,193 8,951 97.3% 96.5% 71.9% 50.6% Tissue prediction is accurate across data sets slide adapted from Shannon Ellis

Slide 46

Slide 46 text

Number of Regions 589 589 589 589 589 Number of Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis

Slide 47

Slide 47 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 48

Slide 48 text

Slide 49

Slide 49 text

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted

Slide 50

Slide 50 text

bioconductor.org/packages/derfinder bioconductor.org/packages/recount > biocLite(“derfinder”) > biocLite(“recount”) http://rail.bio $ ./install-rail-rna-V

Slide 51

Slide 51 text

https://github.com/leekgroup/recount-contributions

Slide 52

Slide 52 text

STEPS LIBD RNA-seq pipeline 1.Quality check (QC) on raw reads 2.Failed QC? Then trim reads 3.Align reads to the genome 4.Count features 5.Calculate coverage 6.Transcript level quantification 7.Create count tables 8.Call variants for identifying swaps Work with Emily Burke

Slide 53

Slide 53 text

Collaborators The Leek Group Jeff Leek Shannon Ellis Hopkins Ben Langmead Chris Wilks Kai Kammers Kasper Hansen Margaret Taub OHSU Abhinav Nellore LIBD Andrew Jaffe Emily Burke Stephen Semick Carrie Wright Badoi Phan Amanda Price Nina Rajpurohit Funding NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer