Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CDSBMexico

 CDSBMexico

From learning to using to teaching to developing R

Leonardo Collado Torres
@fellgernon #rstats #teaching #CDSBMexico

Leonardo Collado-Torres

July 30, 2018
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. TIB.2018(R para todos) & Latin American R/Bioconductor Developers Workshop From

    learning to using to teaching to developing R Leonardo Collado Torres @fellgernon #rstats #teaching #CDSBMexico https://speakerdeck.com/lcolladotor/CDSBMexico
  2. Who knows about ? Sandrine Dudoit: She’s one of the

    @Bioconductor project founders! @cendrinou
  3. @areyesq http://alejandroreyes.org/ Alejandro Reyes (first BioC: 2009) BioC2009 + BioC2010

    + BioC2011 Developer’s day + 2 conference days + Europe Bioc 2010 http://www- huber.embl.de/biocdeveleurope2010/ With support from: @Bioconductor, @lcgunam, @WINTERGENOMICS
  4. @fellgernon #rstats #teaching #educollab http://lcolladotor.github.io/courses/Courses/B/ (has videos of me teaching

    :P, it was a pilot for OpenCourseware) TAs: Alejandro Reyes @areyesq José Víctor Moreno Mayar https://geogenetics.ku.dk/staff/?pure=en/persons/475726 José Reyes http://sysbiophd.harvard.edu/people/student-profiles/jose-reyes
  5. Always ask for support! • Support for traveling or registration

    or lodging • Support for teaching: Robert Gentleman gave me free copies of books he had in his office (authors normally get several free copies of books) • Support for community building: almost had Bioconductor’s support in 2010ish for 1 visit, we didn’t give up! #CDSBMexico • Feel free to ask for help! We all started somewhere!! Check your spam box and filters: • Almost lost a scholarship for user!2013 that way :P Check the dates for applying for support! Ask for emails and keep in touch • I asked for PhD application and career advice to Davis McCarthy @davisjmcc in 2010 • That’s how I got into my PhD Socialize! Take advantage of opportunities offered to you!
  6. BioC2010 First time presenting a poster about an R package

    (BacterialTranscription): Transcription initiation mapping and transcription unit identification in E. coli Rafael Irizarry https://rafalab.github.io/ @rafalab Ingo Ruczinski http://www.biostat.jhsph.edu/~iruczins/ Them: Have you heard about Johns Hopkins? Me: Johns???? No idea Them: come join us at @jhubiostat !!
  7. SRA

  8. jx 1 jx 2 jx 3 jx 4 jx 5

    jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5
  9. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/
  10. Fetal Infant Child Teen Adult 50+ 6 / group, N

    = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data @andrewejaffe
  11. expression data for ~70,000 human samples GTEx N=9,962 TCGA N=11,284

    SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis @Shannon_E_Ellis
  12. expression data for ~70,000 human samples Answer meaningful questions about

    human biology and expression GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis @Shannon_E_Ellis
  13. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis @Shannon_E_Ellis
  14. Category Frequency F 95 female 2036 Female 51 M 77

    male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$S ex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis @Shannon_E_Ellis
  15. SRA phenotype information is far from complete SubjectID Sex Tissue

    Race Age 662 0 NA female liver NA NA 662 1 NA female liver NA NA 662 2 NA female liver NA NA 662 3 NA female liver NA NA 662 4 NA female liver NA NA 662 5 NA male liver NA NA 662 6 NA male liver NA NA 662 7 NA male liver NA NA 662 8 NA male liver NA NA z z z z slide adapted from Shannon Ellis @Shannon_E_Ellis
  16. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis @Shannon_E_Ellis
  17. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis
  18. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis
  19. Goal : to accurately predict critical phenotype information for all

    samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis
  20. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis @Shannon_E_Ellis
  21. Sex prediction is accurate across data sets Number of Regions

    20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis @Shannon_E_Ellis
  22. Number of Regions 589 589 589 589 Number of Samples

    (N) 4,769 4,769 7,193 8,951 97.3 % 96.5 % 71.9 % 50.6 % Tissue prediction is accurate across data sets slide adapted from Shannon Ellis @Shannon_E_Ellis
  23. Number of Regions 589 589 589 589 589 Number of

    Samples (N) 4,769 4,769 613 6,579 8,951 97.3 % 96.5 % 91.0 % 70.2 % Prediction is more accurate in healthy tissue 50.6 % slide adapted from Shannon Ellis @Shannon_E_Ellis
  24. > library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata'))

    > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/
  25. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis @Shannon_E_Ellis
  26. expression data for ~70,000 human samples samples phenotypes ? GTEx

    N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis @Shannon_E_Ellis
  27. Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian

    Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed
  28. Code Example: research.libd.org/recount-brain/example_PMI/example_PMI.html research.libd.org/recount-brain/example_PMI/example_PMI.Rmd Replicates part of the GTEx PMI

    paper by Ferreira et al. doi.org/10.1038/s41467-017-02772-x Ashkaun Razmara, in prep. http://research.libd.org/recount-brain/ @ashkaun_razmara
  29. The recount2 team Hopkins Kai Kammers Shannon Ellis Margaret Taub

    Kasper Hansen Jeff Leek Ben Langmead OHSU Abhinav Nellore LIBD Leonardo Collado-Torres Andrew Jaffe recount-brain Ashkaun Razmara Funding and hosting NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer
  30. This is where it starts for you and us: #CDSBMexico

    @CDSBMexico It’s your home now! Help us build it and maintain it! Submit your blog posts too!
  31. expression data for ~70,000 human samples (Multiple) Postdoc positions available

    to - develop methods to process and analyze data from recount2 - use recount2 to address specific biological questions This project involves the Hansen, Leek, Langmead and Battle labs at JHU Contact: Kasper D. Hansen ([email protected] | www.hansenlab.org) @KasperDHansen @jtleek @BenLangmead @alexisjbattle