Slide 1

Slide 1 text

ANALYZING GENOMICS DATA IN R WITH BIOCONDUCTOR Stephanie Hicks Assistant Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health #rstatsdc Conference November 8, 2018

Slide 2

Slide 2 text

ABOUT ME Teaching: Data Science Research: Genomics • R/Bioconductor developer Other fun things about me: • Co-founded R-Ladies Baltimore • Creating a children’s book featuring women statisticians and data scientists

Slide 3

Slide 3 text

COMPREHENSIVE R ARCHIVE NETWORK (CRAN)

Slide 4

Slide 4 text

#rstatsdc #rladies #rstats

Slide 5

Slide 5 text

https://www.r-project.org/other-projects.html

Slide 6

Slide 6 text

https://www.r-project.org/other-projects.html

Slide 7

Slide 7 text

CRAN, MEET YOUR COUSIN BIOCONDUCTOR!

Slide 8

Slide 8 text

• Open-source, open development software project • Began in 2001 • Big priorities: reproducible research and high-quality documentation • Vignettes • Diverse community support • Workflows (super helpful for n00bs) • Teaching resources and open development

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

EXPLORE BIOCONDUCTOR STATISTICS WITH BIOCPKGTOOLS Functions to access metadata around Bioc packages and usage in a tidy data format. library(BiocPkgTools) pkgs <- biocDownloadStats() head(pkgs) ## # A tibble: 6 x 6 ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo ## ## 1 ABarray 2018 Jan 117 150 Software ## 2 ABarray 2018 Feb 97 125 Software ## 3 ABarray 2018 Mar 102 121 Software ## 4 ABarray 2018 Apr 229 359 Software ## 5 ABarray 2018 May 99 134 Software ## 6 ABarray 2018 Jun 133 209 Software

Slide 11

Slide 11 text

HOW MANY PACKAGES IN BIOCONDUCTOR? pkgs %>% filter(Year==2018) %>% select(Package, repo) %>% distinct() %>% group_by(repo) %>% summarize(total_packages=n()) ## # A tibble: 3 x 2 ## repo total_packages ## ## 1 AnnotationData 1124 ## 2 ExperimentData 400 ## 3 Software 1733 • Annotation packages = streamlines tedious bookkeeping • Experiment data packages = contains processed data; useful for teaching

Slide 12

Slide 12 text

BIOCONDUCTOR SOFTWARE PACKAGES OVER TIME ● ● ● ● ● ● ● ● ● ● 600 900 1200 1500 1800 2010 2012 2014 2016 2018 Year Package count Number of Bioconductor Software Packages

Slide 13

Slide 13 text

STANDARD BIOC DATA STRUCTURE GenomicRanges (GRanges) Lee et al. (2018), bioRixv

Slide 14

Slide 14 text

CREATE GRANGES OBJECT library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand = c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

Slide 15

Slide 15 text

THINGS YOU CAN DO WITH GRANGES OBJECTS width(gr) ## [1] 18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

Slide 16

Slide 16 text

GENOMIC VERBS/ACTIONS + TIDY DATA = PLYRANGES • Goal: Write human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence

Slide 17

Slide 17 text

library(plyranges) gr %>% filter(score > 15) ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Feel free to send comments/questions: Twitter: @stephaniehicks Email: shicks19@jhu.edu #rstatsdc #rladies Thank you!