Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reproducible RNA-seq analysis with recount and recount-brain

Reproducible RNA-seq analysis with recount and recount-brain

Remote class/talk for LCG-UNAM 2018 on recount and recount-brain.

Tweet

Transcript

  1. 11 Reproducible RNA-seq analysis with Leonardo Collado-Torres @fellgernon #LCG2018 and

  2. Reference genome Reads

  3. None
  4. GTE TCGA slide adapted from Shannon Elli

  5. SRA

  6. Slide adapted from Ben Langmead

  7. http://rail.bio/ Slide adapted from Ben Langmead

  8. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

  9. https://jhubiostatistics.shinyapps.io/recount/

  10. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  11. None
  12. exon 1 exon 2 exon 3

  13. disjoint exon 1 disjoint exon 2 disjoint exon 3

  14. None
  15. None
  16. None
  17. 5 10 15 0 1 2 3 4 5 Genome

    Coverage 3 3 5 4 4 2 2 3 1 3 3 1 4 4 2 1 AUC = area under coverage = 45
  18. None
  19. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  20. slide adapted from Jeff Leek

  21. >library('recount') > download_study('SRP029880', type='rse-gene') > download_study('SRP059039', type='rse-gene') > load(file.path('SRP029880 ',

    'rse_gene.Rdata')) > load(file.path('SRP059039', 'rse_gene.Rdata')) > mdat <- do.call(cbind, dat) https://github.com/leekgroup/recount-analyses/
  22. Collado Torres et al. Nat. Biotech 2017

  23. None
  24. None
  25. None
  26. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  27. Collado-Torres et al, NAR, 2017

  28. Fetal Infant Child Teen Adult 50+ 6 / group, N

    = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data
  29. Jaffe et al, Nat. Neuroscience, 2015

  30. BrainSpan data Jaffe et al, Nat. Neuroscience, 2015

  31. expression data for ~70,000 human samples GTEx N=9,962 TCGA N=11,284

    SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis
  32. expression data for ~70,000 human samples Answer meaningful questions about

    human biology and expression GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis
  33. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  34. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$S ex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  35. SRA phenotype information is far from complete SubjectID Sex Tissue

    Race Age 662 0 NA female liver NA NA 662 1 NA female liver NA NA 662 2 NA female liver NA NA 662 3 NA female liver NA NA 662 4 NA female liver NA NA 662 5 NA male liver NA NA 662 6 NA male liver NA NA 662 7 NA male liver NA NA 662 8 NA male liver NA NA z z z z slide adapted from Shannon Ellis
  36. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis
  37. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  38. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  39. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  40. select_regions() Output: Coverage matrix (data.frame) Region information slide adapted from

    Shannon Ellis
  41. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis
  42. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis
  43. http://www.rna-seqblog.com/ Can we use expression data to predict tissue? slide

    adapted from Shannon Ellis
  44. Number of Regions 589 589 589 589 Number of Samples

    (N) 4,769 4,769 7,193 8,951 97.3 % 96.5 % 71.9 % 50.6 % Tissue prediction is accurate across data sets slide adapted from Shannon Ellis
  45. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3 % 96.5 % 91.0 % 70.2 % Prediction is more accurate in healthy tissue 50.6 % slide adapted from Shannon Ellis
  46. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  47. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis
  48. None
  49. slide adapted from Kai Kammers Can combine with genotype data

    to identify eQTLs
  50. biorxiv.org/content/early/2018/01/12/247346

  51. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis
  52. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed
  53. Ashkaun Razmara, in prep.

  54. None
  55. None
  56. Code Example: research.libd.org/recount-brain/example_PMI/example_PMI.html research.libd.org/recount-brain/example_PMI/example_PMI.Rmd Replicates part of the GTEx PMI

    paper by Ferreira et al. doi.org/10.1038/s41467-017-02772-x Ashkaun Razmara, in prep.
  57. The recount2 team Hopkins Kai Kammers Shannon Ellis Margaret Taub

    Kasper Hansen Jeff Leek Ben Langmead OHSU Abhinav Nellore LIBD Leonardo Collado-Torres Andrew Jaffe recount-brain Ashkaun Razmara Funding and hosting NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer
  58. expression data for ~70,000 human samples (Multiple) Postdoc positions available

    to - develop methods to process and analyze data from recount2 - use recount2 to address specific biological questions This project involves the Hansen, Leek, Langmead and Battle labs at JHU Contact: Kasper D. Hansen (khansen@jhsph.edu | www.hansenlab.org)