ANALYZING GENOMICS DATA IN R WITH BIOCONDUCTOR Stephanie Hicks Assistant Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health #rstatsdc Conference November 8, 2018

ABOUT ME Teaching: Data Science Research: Genomics • R/Bioconductor developer Other fun things about me: • Co-founded R-Ladies Baltimore • Creating a children’s book featuring women statisticians and data scientists

• Open-source, open development software project • Began in 2001 • Big priorities: reproducible research and high-quality documentation • Vignettes • Diverse community support • Workflows (super helpful for n00bs) • Teaching resources and open development

EXPLORE BIOCONDUCTOR STATISTICS WITH BIOCPKGTOOLS Functions to access metadata around Bioc packages and usage in a tidy data format. library(BiocPkgTools) pkgs <- biocDownloadStats() head(pkgs) ## # A tibble: 6 x 6 ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo ## ## 1 ABarray 2018 Jan 117 150 Software ## 2 ABarray 2018 Feb 97 125 Software ## 3 ABarray 2018 Mar 102 121 Software ## 4 ABarray 2018 Apr 229 359 Software ## 5 ABarray 2018 May 99 134 Software ## 6 ABarray 2018 Jun 133 209 Software

HOW MANY PACKAGES IN BIOCONDUCTOR? pkgs %>% filter(Year==2018) %>% select(Package, repo) %>% distinct() %>% group_by(repo) %>% summarize(total_packages=n()) ## # A tibble: 3 x 2 ## repo total_packages ## ## 1 AnnotationData 1124 ## 2 ExperimentData 400 ## 3 Software 1733 • Annotation packages = streamlines tedious bookkeeping • Experiment data packages = contains processed data; useful for teaching

BIOCONDUCTOR SOFTWARE PACKAGES OVER TIME ● ● ● ● ● ● ● ● ● ● 600 900 1200 1500 1800 2010 2012 2014 2016 2018 Year Package count Number of Bioconductor Software Packages

STANDARD BIOC DATA STRUCTURE GenomicRanges (GRanges) Lee et al. (2018), bioRixv

CREATE GRANGES OBJECT library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand = c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

THINGS YOU CAN DO WITH GRANGES OBJECTS width(gr) ## [1] 18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome

GENOMIC VERBS/ACTIONS + TIDY DATA = PLYRANGES • Goal: Write human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence

library(plyranges) gr %>% filter(score > 15) ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## | ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001

Thank you!