Upgrade to Pro — share decks privately, control downloads, hide ads and more …

jgm2018

 jgm2018

7382f7fe30561274624635116513ca37?s=128

Leonardo Collado-Torres

November 29, 2018
Tweet

Transcript

  1. 11 Reproducible RNA-seq analysis with Leonardo Collado-Torres @fellgernon Data Science

    I with Andrew Jaffe and
  2. Lieber Institute for Brain Development • Role: Staff Scientist •

    PI: Andrew Jaffe Team: • Staff Scientist: Emily Burke • Postdocs: Carrie Wright, Amanda Price • Grad students: Matt Nguyen, Brianna Barry • Research Associate: Anandita Rajpurohit • Research Assistant: Nick Eagles, Stephen Semick* • Role details: like a postdoc (major role in some projects) with some support projects • https://www.libd.org/careers/ & you can always email Andrew andrew.jaffe@libd.org for job inquiries * Now at University of Maryland in med school
  3. Lieber Institute for Brain Development • miRNA kit prep comparison

    biorxiv.org/content/early/2018/10/30/445437 • DNAm and gene DE in Alzheimer’s disease biorxiv.org/content/early/2018/09/29/430603 • DNAm on WGBS across development & cell types biorxiv.org/content/early/2018/09/29/428391 • RNA-seq DE in Schizophrenia disorder & 2 brain regions biorxiv.org/content/early/2018/09/26/426213 • RNA-seq from stem cells biorxiv.org/content/early/2018/07/31/380758 Peer reviewed: • RNA-seq smoking during pregnancy doi.org/10.1038/s41380-018-0223-1 • RNA-seq DE in Schizophrenia disorder on DLPFC doi.org/10.1038/s41593-018-0197-y • Histamine signaling in autism spectrum disorder doi.org/10.1038/tp.2017.87
  4. https://jhubiostatistics.shinyapps.io/recount/

  5. http://rail.bio/ Slide adapted from Ben Langmead by Abhinav Nellore

  6. http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

  7. GTEx TCGA slide adapted from Shannon Elli

  8. SRA

  9. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5 doi.org/10.12688/f1000research.12223.1
  10. doi.org/10.12688/f1000research.12223.1

  11. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  12. slide adapted from Jeff Leek

  13. related projects • Bioconductor recountWorkflow: doi.org/10.12688/f1000research.12223.1 • Shannon Ellis &

    Leek: phenotype prediction doi.org/10.1093/nar/gky102 • Jack Fu & Taub: transcript estimations biorxiv.org/content/early/2018/05/25/247346 • Madugundu & Pandey (JHU): proteomics • Luidi-Imada & Marchionni (JHU): cancer and FANTOM • Kuri-Magaña & Martínez-Barnetche (INSP Mexico): immune expression doi.org/10.3389/fimmu.2018.02679 • Ryten (UCL): Guelfi: validating expressed regions (ERs) eQTLs Zhang: improving the detection of ERs
  14. expression data for ~70,000 human samples (Multiple) Postdoc positions available

    to - develop methods to process and analyze data from recount2 - use recount2 to address specific biological questions This project involves the Hansen, Leek, Langmead and Battle labs at JHU Contact: Kasper D. Hansen (khansen@jhsph.edu | www.hansenlab.org)
  15. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis
  16. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$S ex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis
  17. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis
  18. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  19. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  20. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accuracy of predictor predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis
  21. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis
  22. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3 % 96.5 % 91.0 % 70.2 % Prediction is more accurate in healthy tissue 50.6 % slide adapted from Shannon Ellis
  23. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  24. • • • • • • • • • •

    • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • adipose tissue adrenal gland bladder blood blood vessel bone bone marrow brain breast cervix cervix uteri colon epithelium esophagus fallopian tube heart intestine kidney liver lung melanoma monocytes muscle nerve ovary pancreas penis pituitary placenta prostate salivary gland skin small intestine spinal cord spleen stem cell stomach testis thyroid tonsil umbilical cord urinary bladder uterus vagina 0 3000 6000 9000 12000 0 1000 2000 3000 reported predicted
  25. Ashkaun Razmara • MPH student • MD candidate • Interested

    in becoming a neurosurgeon • Also interested in reproducibility https://ashkaunrazmara.com/ Ashkaun Razmara, in prep.
  26. recount-brain hosts sample metadata for >4,000 human brain tissue samples

    from >60 projects from the SRA, out of which 3,214 (72.5%) samples have expression data available from recount2 (Collado-Torres et al. 2017c). recount-brain supports powerful search and filter capabilities by tissue phenotype, including spanning 3,600 neurological controls with 2,900 from SRP025982 (SEQC/MAQC-III Consortium 2014); 15 neurological diseases and information on brain tumor subtype, grade, and stage; 3 levels of detailed anatomic tissue site information; 5 developmental stages (Fetus, Infant, Child, Adolescent, Adult); demographic data (age, sex, race); and technical sequencing information (RIN, PMI, sequencing layout, library source, etc.). • 62 SRA studies • 4,431 rows by 48 columns Ashkaun Razmara, in prep.
  27. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed Ashkaun Razmara, in prep.
  28. https://github.com/LieberInstitute/recount-brain/tree/master/metadata_reproducibility • Overall curation steps: starts by downloading SRA Run

    Table info, then info from the publications • Details for each SRA study Reproducibility document Ashkaun Razmara, in prep.
  29. Ashkaun Razmara, in prep.

  30. None
  31. Ashkaun Razmara, in prep.

  32. (A) (B) −0.15 −0.05 0.05 0.15 −0.2 −0.1 0.0 0.1

    PC (gene level) : log2(normalized(RPKM) + 0.5) PC1 PC2 −0.2 −0.1 0.0 0.1 −0.15 −0.05 0.05 0.15 PC (gene level) : log2(normalized(RPKM) + 0.5) PC3 PC4 −0.15 −0.05 0.05 0.15 −0.1 0.0 0.1 0.2 PC (gene level) : log2(normalized(RPKM) + 0.5) PC5 PC6 −0.2 −0.1 0.0 0.1 −0.2 −0.1 0.0 0.1 PC (gene level) : log2(normalized(RPKM) + 0.5) PC7 PC8 SRP027383 SRP044668 TCGA 0 200 400 600 800 1000 0 200 400 600 800 1000 Concordance ordered genes in reference study ordered genes in new study SRP027383_SRP044668 SRP027383_TCGA SRP044668_TCGA brain_kidney By Shannon Ellis
  33. Code Example: research.libd.org/recount-brain/example_PMI/example_PMI.html research.libd.org/recount-brain/example_PMI/example_PMI.Rmd Replicates part of the GTEx PMI

    paper by Ferreira et al. doi.org/10.1038/s41467-017-02772-x Ashkaun Razmara, in prep.
  34. * Jeff Leek presented Shannon Ellis’ prediction work in Toronto

    (around April 2018) https://docs.google.com/presentation/d/1FgUZZU6pW91J7zH0OqrEgxfnV1Py_ZGL3ZKHfbOZskY/edit#slide=id.g2f831fd4ae_0_306 * Dustin J. Sokolowski from Michael D. Wilson’s lab is using recount2 * Dustin joins the project and merges recount-brain with GTEx and TCGA Ashkaun Razmara, in prep. recount_brain_v2 The SRA samples in recount-brain are complemented by 1,409 GTEx (GTEx Consortium 2015) and 707 TCGA (Brennan et al. 2013; Cancer Genome Atlas Research Network et al. 2015) samples covering 13 healthy regions of the brain and 2 tumor types, respectively. In total, there are 6,547 samples with metadata in recount-brain with 5,330 (81.4%) present in recount2
  35. * Discussed with Sean Davis (NIH) at Biological Data Science

    #biodata18 * Potential for contributing recount-brain to SRAdbV2 github.com/seandavi/SRAdbV2 * Currently working with him to make this happen * Might map some columns to ontologies Ashkaun Razmara, in prep. What about maintaining/growing it?
  36. * I’ve been slow at finishing this project!!! * Have

    benefited a bit from being slow: Dustin + Sean’s input have improved recount-brain * Full draft has been checked by authors, got LIBD’s permission to post on bioRxiv * How valuable is this resource? - other curated metadata projects include expression (here expression is already in recount2) * Advice from LW: pre-print, grow user base interest, then submit to a journal quoting usage numbers Ashkaun Razmara, in prep. Lieber Institute for Brain Development
  37. The recount-brain team Hopkins Ashkaun Razmara Shannon E. Ellis Jeff

    T. Leek University of Toronto Dustin J. Sokolowski Michael D. Wilson NIH Sean Davis LIBD Andrew E. Jaffe Funding NIH R01 GM105705 NIH 1R21MH109956 NIH R01 GM121459 CIHR, NSERC Ontario Ministry of Research IDIES SciServer Hosting recount2 github.com/LieberInstitute/recount-brain