Pro Yearly is on sale from $80 to $50! »

SOBP 2017

SOBP 2017

RNA-seq samples beyond the known transcriptome with derfinder available via recount

Tweet

Transcript

  1. 11 RNA-seq samples beyond the known transcriptome with derfinder available

    via recount Leonardo Collado-Torres @fellgernon #SOBP2017
  2. Genome Transcripts Reads slide adapted from Jeff Leek

  3. Genome slide adapted from Jeff Leek

  4. GTEx TCGA slide adapted from Shannon Ellis

  5. SRA

  6. Slide adapted from Ben Langmead

  7. http://rail.bio/ Slide adapted from Ben Langmead

  8. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

  9. Obstacle: our research moves (spot) markets Spike in market price

    due to preprocessing job flows slide adapted from Jeff Leek
  10. Obstacle: our research moves (spot) markets Weekday market volatility Weekend

    EC2 inactivity slide adapted from Jeff Leek
  11. https://jhubiostatistics.shinyapps.io/recount/

  12. slide adapted from Andrew Jaffe

  13. slide adapted from Andrew Jaffe

  14. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  15. slide adapted from Jeff Leek

  16. >library('recount') > download_study('SRP029880', type='rse-gene') > download_study('SRP059039', type='rse-gene') > load(file.path('SRP029880 ',

    'rse_gene.Rdata')) > load(file.path('SRP059039', 'rse_gene.Rdata')) > mdat <- do.call(cbind, dat) https://github.com/leekgroup/recount-analyses/
  17. Collado Torres et al. Nat. Biotech 2017

  18. slide adapted from Andrew Jaffe

  19. slide adapted from Andrew Jaffe

  20. coverage vector 2 6 0 11 6 Genome (DNA) RNA-Sequencing:

    Alignment using Rail-RNA Nellore et al. (2016) Bioinformatics
  21. Collado-Torres et al, NAR, 2017

  22. Fetal Infant Child Teen Adult 50+ 6 / group, N

    = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data
  23. Jaffe et al, Nat. Neuroscience, 2015

  24. DERs outside of “known genes” Jaffe et al, Nat. Neuroscience,

    2015
  25. CBC: 28 MD: 24 STR: 28 AMY: 31 HIP: 32

    DFC: 34 Total N samples: 487 BrainSpan data Coverage Data from BrainSpan: http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/ VFC: 30 MFC: 32 OFC: 30 M1C: 25 S1C: 26 IPC: 33 A1C: 30 STC: 35 ITC: 33 V1C: 33
  26. BrainSpan data Jaffe et al, Nat. Neuroscience, 2015

  27. Percent Expressed Mean reads across GTEx

  28. > library('recount') > regions_list <- bplapply(chrs, function(chr) { regs <-

    expressed_regions('SRP012682', chr, cutoff = 5L) return(regs) }, BPPARAM = bp) > names(regions_list) <- chrs > regions <- unlist(GRangesList(regions_list)) https://github.com/leekgroup/recount-analyses/
  29. > library('recount') > covMat <- bplapply(chrs, function(chr) { coverageMatrix <-

    coverage_matrix('SRP012682’', chr, regions_list[[chr]]) return(coverageMatrix) }, BPPARAM = bp) > covMat <- do.call(rbind, covMat) https://github.com/leekgroup/recount-analyses/
  30. expression data for ~70,000 human samples GTEx N=9,962 TCGA N=11,284

    SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis
  31. expression data for ~70,000 human samples Answer meaningful questions about

    human biology and expression GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis
  32. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  33. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$Se x “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  34. SRA phenotype information is far from complete SubjectID Sex Tissue

    Race Age 6620 NA female liver NA NA 6621 NA female liver NA NA 6622 NA female liver NA NA 6623 NA female liver NA NA 6624 NA female liver NA NA 6625 NA male liver NA NA 6626 NA male liver NA NA 6627 NA male liver NA NA 6628 NA male liver NA NA 6629 NA male liver NA NA 6630 NA male liver NA NA 6631 NA NA blood NA NA 6632 NA NA blood NA NA 6633 NA NA blood NA NA 6634 NA NA blood NA NA 6635 NA NA blood NA NA 6636 NA NA blood NA NA z z z z slide adapted from Shannon Ellis
  35. slide adapted from Jeff Leek

  36. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis
  37. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  38. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  39. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  40. phenopredict Expression Data Covariate Informatio n Genomic Region Information Pheno

    of Interest n p regions x individuals Input Data select_regions() build_predictor() test_predictor() extract_data() predict_pheno() functions slide adapted from Shannon Ellis
  41. select_regions() Output: Coverage matrix (data.frame) Region information (GRanges) slide adapted

    from Shannon Ellis
  42. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis
  43. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8% 99.6% 99.4% 88.5% slide adapted from Shannon Ellis
  44. http://www.rna-seqblog.com/ Can we use expression data to predict tissue? slide

    adapted from Shannon Ellis
  45. Number of Regions 589 589 589 589 Number of Samples

    (N) 4,769 4,769 7,193 8,951 97.3% 96.5% 71.9% 50.6% Tissue prediction is accurate across data sets slide adapted from Shannon Ellis
  46. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3% 96.5% 91.0% 70.2% Prediction is more accurate in healthy tissue 50.6% slide adapted from Shannon Ellis
  47. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  48. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis
  49. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted
  50. bioconductor.org/packages/derfinder bioconductor.org/packages/recount > biocLite(“derfinder”) > biocLite(“recount”) http://rail.bio $ ./install-rail-rna-V

  51. https://github.com/leekgroup/recount-contributions

  52. STEPS LIBD RNA-seq pipeline 1.Quality check (QC) on raw reads

    2.Failed QC? Then trim reads 3.Align reads to the genome 4.Count features 5.Calculate coverage 6.Transcript level quantification 7.Create count tables 8.Call variants for identifying swaps Work with Emily Burke
  53. Collaborators The Leek Group Jeff Leek Shannon Ellis Hopkins Ben

    Langmead Chris Wilks Kai Kammers Kasper Hansen Margaret Taub OHSU Abhinav Nellore LIBD Andrew Jaffe Emily Burke Stephen Semick Carrie Wright Badoi Phan Amanda Price Nina Rajpurohit Funding NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer