Slide 1

Slide 1 text

TIB.2018(R para todos) & Latin American R/Bioconductor Developers Workshop From learning to using to teaching to developing R Leonardo Collado Torres @fellgernon #rstats #teaching #CDSBMexico https://speakerdeck.com/lcolladotor/CDSBMexico

Slide 2

Slide 2 text

@lcgunam

Slide 3

Slide 3 text

@cendrinou https://www.stat.berkeley.edu/users/sandrine/ Sandrine Dudoit Fall 2007

Slide 4

Slide 4 text

Who knows about ? Sandrine Dudoit: She’s one of the @Bioconductor project founders! @cendrinou

Slide 5

Slide 5 text

http://www.wholebiome.com/team.html#james-bullard James Bullard January 2008 1 week intense course

Slide 6

Slide 6 text

@AlexielMedyna http://liigh.unam.mx/profile/dra-alejandra-medina-rivera/ Alejandra Medina Rivera BioC2008 Developer’s day + 2 conference days Supported by @lcgunam

Slide 7

Slide 7 text

@Bioconductor https://bioconductor.org/help/course-materials/2008/BioC2008/

Slide 8

Slide 8 text

#lattice Deepayan Sarkar https://github.com/deepayan

Slide 9

Slide 9 text

#ShortRead @mt_morgan @Bioconductor http://bioconductor.org/packages/ShortRead

Slide 10

Slide 10 text

@fellgernon & Osam http://lcolladotor.github.io/courses/Courses/R/ Fall 2008

Slide 11

Slide 11 text

@areyesq http://alejandroreyes.org/ Alejandro Reyes (first BioC: 2009) BioC2009 + BioC2010 + BioC2011 Developer’s day + 2 conference days + Europe Bioc 2010 http://www- huber.embl.de/biocdeveleurope2010/ With support from: @Bioconductor, @lcgunam, @WINTERGENOMICS

Slide 12

Slide 12 text

@fellgernon #rstats #teaching #educollab http://lcolladotor.github.io/ 2008-2011

Slide 13

Slide 13 text

@fellgernon #rstats #teaching #educollab http://lcolladotor.github.io/courses/Courses/B/ (has videos of me teaching :P, it was a pilot for OpenCourseware) TAs: Alejandro Reyes @areyesq José Víctor Moreno Mayar https://geogenetics.ku.dk/staff/?pure=en/persons/475726 José Reyes http://sysbiophd.harvard.edu/people/student-profiles/jose-reyes

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Always ask for support! • Support for traveling or registration or lodging • Support for teaching: Robert Gentleman gave me free copies of books he had in his office (authors normally get several free copies of books) • Support for community building: almost had Bioconductor’s support in 2010ish for 1 visit, we didn’t give up! #CDSBMexico • Feel free to ask for help! We all started somewhere!! Check your spam box and filters: • Almost lost a scholarship for user!2013 that way :P Check the dates for applying for support! Ask for emails and keep in touch • I asked for PhD application and career advice to Davis McCarthy @davisjmcc in 2010 • That’s how I got into my PhD Socialize! Take advantage of opportunities offered to you!

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

BioC2010 First time presenting a poster about an R package (BacterialTranscription): Transcription initiation mapping and transcription unit identification in E. coli Rafael Irizarry https://rafalab.github.io/ @rafalab Ingo Ruczinski http://www.biostat.jhsph.edu/~iruczins/ Them: Have you heard about Johns Hopkins? Me: Johns???? No idea Them: come join us at @jhubiostat !!

Slide 20

Slide 20 text

11 Reproducible RNA-seq analysis with Leonardo Collado-Torres @fellgernon #CDSBMexico and

Slide 21

Slide 21 text

Reference genome Reads

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

GTEx TCGA slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 24

Slide 24 text

SRA

Slide 25

Slide 25 text

Slide adapted from Ben Langmead @BenLangmead

Slide 26

Slide 26 text

http://rail.bio/ Slide adapted from Ben Langmead @AbhiNellore @BenLangmead

Slide 27

Slide 27 text

http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

Slide 28

Slide 28 text

https://jhubiostatistics.shinyapps.io/recount/

Slide 29

Slide 29 text

jx 1 jx 2 jx 3 jx 4 jx 5 jx 6 Coverage Reads Gene Isoform 1 Isoform 2 Potential isoform 3 exon 1 exon 2 exon 3 exon 4 Expressed region 1: potential exon 5

Slide 30

Slide 30 text

Uses the #SummarizedExperiment @Bioconductor package

Slide 31

Slide 31 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 32

Slide 32 text

slide adapted from Jeff Leek @jtleek

Slide 33

Slide 33 text

Collado-Torres et al, NAR, 2017

Slide 34

Slide 34 text

Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Discovery data Jaffe et al, Nat. Neuroscience, 2015 Postmortem Human Brain Samples Fetal Infant Child Teen Adult 50+ 6 / group, N = 36 Replication data @andrewejaffe

Slide 35

Slide 35 text

Jaffe et al, Nat. Neuroscience, 2015 @andrewejaffe

Slide 36

Slide 36 text

BrainSpan data Jaffe et al, Nat. Neuroscience, 2015 Method implemented in the #derfinder @Bioconductor package

Slide 37

Slide 37 text

expression data for ~70,000 human samples GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 38

Slide 38 text

expression data for ~70,000 human samples Answer meaningful questions about human biology and expression GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 39

Slide 39 text

expression data for ~70,000 human samples samples phenotypes ? GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 40

Slide 40 text

Category Frequency F 95 female 2036 Female 51 M 77 male 1240 Male 141 Total 3640 Even when information is provided, it’s not always clear… sra_meta$S ex “1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown” slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 41

Slide 41 text

SRA phenotype information is far from complete SubjectID Sex Tissue Race Age 662 0 NA female liver NA NA 662 1 NA female liver NA NA 662 2 NA female liver NA NA 662 3 NA female liver NA NA 662 4 NA female liver NA NA 662 5 NA male liver NA NA 662 6 NA male liver NA NA 662 7 NA male liver NA NA 662 8 NA male liver NA NA z z z z slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 42

Slide 42 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 TCGA The Cancer Genome Atlas N=11,284 GTEx Genotype Tissue Expression Project N=9,662 slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 43

Slide 43 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 44

Slide 44 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 45

Slide 45 text

Goal : to accurately predict critical phenotype information for all samples in recount gene, exon, exon-exon junction and expressed region RNA-Seq data SRA Sequence Read Archive N=49,848 GTEx Genotype Tissue Expression Project N=9,662 divide samples build and optimize phenotype predictor training set predict phenotypes across SRA samples test accurac y of predicto r predict phenotypes across samples in TCGA test set TCGA The Cancer Genome Atlas N=11,284 slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 46

Slide 46 text

select_regions() Output: Coverage matrix (data.frame) Region information (GRanges) slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 47

Slide 47 text

Sex prediction is accurate across data sets Number of Regions 20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 48

Slide 48 text

Sex prediction is accurate across data sets Number of Regions 20 20 20 20 Number of Samples (N) 4,769 4,769 11,245 3,640 99.8 % 99.6 % 99.4 % 88.5 % slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 49

Slide 49 text

http://www.rna-seqblog.com/ Can we use expression data to predict tissue? slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 50

Slide 50 text

Number of Regions 589 589 589 589 Number of Samples (N) 4,769 4,769 7,193 8,951 97.3 % 96.5 % 71.9 % 50.6 % Tissue prediction is accurate across data sets slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 51

Slide 51 text

Number of Regions 589 589 589 589 589 Number of Samples (N) 4,769 4,769 613 6,579 8,951 97.3 % 96.5 % 91.0 % 70.2 % Prediction is more accurate in healthy tissue 50.6 % slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 52

Slide 52 text

> library('recount') > download_study( 'ERP001942', type='rse-gene') > load(file.path('ERP001942 ', 'rse_gene.Rdata')) > rse <- scale_counts(rse_gene) > rse_with_pred <- add_predictions(rse_gene) https://github.com/leekgroup/recount-analyses/

Slide 53

Slide 53 text

expression data for ~70,000 human samples samples phenotypes ? GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

slide adapted from Kai Kammers Can combine with genotype data to identify eQTLs @KaiKammers

Slide 56

Slide 56 text

biorxiv.org/content/early/2018/01/12/247346 @JFuBiostats @biorxivpreprint

Slide 57

Slide 57 text

expression data for ~70,000 human samples samples phenotypes ? GTEx N=9,962 TCGA N=11,284 SRA N=49,848 samples expression estimates gene exon junctions ERs Answer meaningful questions about human biology and expression sex tissue M Blood F Heart F Liver slide adapted from Shannon Ellis @Shannon_E_Ellis

Slide 58

Slide 58 text

Sex Female Male Age/Development Fetus Child Adolescent Adult Race/Ethnicity Asian Black Hispanic White Tissue Site 1 Cerebral cortex Hippocampus Brainstem Cerebellum Tissue Site 2 Frontal lobe Temporal lobe Midbrain Basal ganglia Tissue Site 3 Dorsolateral prefrontal cortex Superior temporal gyrus Substantia nigra Caudate Hemisphere Left Right Brodmann Area 1-52 Disease Status Disease Neurological control Disease Brain tumor Alzheimer’s disease Parkinson’s disease Bipolar disorder Tumor Type Glioblastoma Astrocytoma Oligodendroglioma Ependymoma Clinical Stage 1 Grade I Grade II Grade III Grade IV Clinical Stage 2 Primary Secondary Recurrent Viability Postmortem Biopsy Preparation Frozen Thawed

Slide 59

Slide 59 text

Ashkaun Razmara, in prep. @ashkaun_razmara

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

Code Example: research.libd.org/recount-brain/example_PMI/example_PMI.html research.libd.org/recount-brain/example_PMI/example_PMI.Rmd Replicates part of the GTEx PMI paper by Ferreira et al. doi.org/10.1038/s41467-017-02772-x Ashkaun Razmara, in prep. http://research.libd.org/recount-brain/ @ashkaun_razmara

Slide 63

Slide 63 text

The recount2 team Hopkins Kai Kammers Shannon Ellis Margaret Taub Kasper Hansen Jeff Leek Ben Langmead OHSU Abhinav Nellore LIBD Leonardo Collado-Torres Andrew Jaffe recount-brain Ashkaun Razmara Funding and hosting NIH R01 GM105705 NIH 1R21MH109956 CONACyT 351535 AWS in Education Seven Bridges IDIES SciServer

Slide 64

Slide 64 text

There are many communities you can join! Ask for help / support / #rstats love ^_^

Slide 65

Slide 65 text

#Rladies @RLadiesGlobal

Slide 66

Slide 66 text

Check #runconf18 @rOpenSci

Slide 67

Slide 67 text

This is where it starts for you and us: #CDSBMexico @CDSBMexico It’s your home now! Help us build it and maintain it! Submit your blog posts too!

Slide 68

Slide 68 text

expression data for ~70,000 human samples (Multiple) Postdoc positions available to - develop methods to process and analyze data from recount2 - use recount2 to address specific biological questions This project involves the Hansen, Leek, Langmead and Battle labs at JHU Contact: Kasper D. Hansen (khansen@jhsph.edu | www.hansenlab.org) @KasperDHansen @jtleek @BenLangmead @alexisjbattle

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

https://speakerdeck.com/lcolladotor/CDSBMexico