Analyzing Genomics Data in R with Bioconductor

Analyzing Genomics Data in R with Bioconductor

Slide deck for Stephanie Hicks from the 2018 DC R Conference (https://dc.rstats.ai)

68c6191fa302627da003b9ac1eaba4b5?s=128

Stephanie Hicks

November 08, 2018
Tweet

Transcript

  1. ANALYZING GENOMICS DATA IN R WITH BIOCONDUCTOR Stephanie Hicks Assistant

    Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health #rstatsdc Conference November 8, 2018
  2. ABOUT ME Teaching: Data Science Research: Genomics • R/Bioconductor developer

    Other fun things about me: • Co-founded R-Ladies Baltimore • Creating a children’s book featuring women statisticians and data scientists
  3. COMPREHENSIVE R ARCHIVE NETWORK (CRAN)

  4. #rstatsdc #rladies #rstats

  5. https://www.r-project.org/other-projects.html

  6. https://www.r-project.org/other-projects.html

  7. CRAN, MEET YOUR COUSIN BIOCONDUCTOR!

  8. • Open-source, open development software project • Began in 2001

    • Big priorities: reproducible research and high-quality documentation • Vignettes • Diverse community support • Workflows (super helpful for n00bs) • Teaching resources and open development
  9. None
  10. EXPLORE BIOCONDUCTOR STATISTICS WITH BIOCPKGTOOLS Functions to access metadata around

    Bioc packages and usage in a tidy data format. library(BiocPkgTools) pkgs <- biocDownloadStats() head(pkgs) ## # A tibble: 6 x 6 ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo ## <fct> <int> <fct> <int> <int> <chr> ## 1 ABarray 2018 Jan 117 150 Software ## 2 ABarray 2018 Feb 97 125 Software ## 3 ABarray 2018 Mar 102 121 Software ## 4 ABarray 2018 Apr 229 359 Software ## 5 ABarray 2018 May 99 134 Software ## 6 ABarray 2018 Jun 133 209 Software
  11. HOW MANY PACKAGES IN BIOCONDUCTOR? pkgs %>% filter(Year==2018) %>% select(Package,

    repo) %>% distinct() %>% group_by(repo) %>% summarize(total_packages=n()) ## # A tibble: 3 x 2 ## repo total_packages ## <chr> <int> ## 1 AnnotationData 1124 ## 2 ExperimentData 400 ## 3 Software 1733 • Annotation packages = streamlines tedious bookkeeping • Experiment data packages = contains processed data; useful for teaching
  12. BIOCONDUCTOR SOFTWARE PACKAGES OVER TIME • • • • •

    • • • • • 600 900 1200 1500 1800 2010 2012 2014 2016 2018 Year Package count Number of Bioconductor Software Packages
  13. STANDARD BIOC DATA STRUCTURE GenomicRanges (GRanges) Lee et al. (2018),

    bioRixv
  14. CREATE GRANGES OBJECT library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand

    = c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  15. THINGS YOU CAN DO WITH GRANGES OBJECTS width(gr) ## [1]

    18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  16. GENOMIC VERBS/ACTIONS + TIDY DATA = PLYRANGES • Goal: Write

    human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence
  17. library(plyranges) gr %>% filter(score > 15) ## GRanges object with

    1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001
  18. None
  19. Feel free to send comments/questions: Twitter: @stephaniehicks Email: shicks19@jhu.edu #rstatsdc

    #rladies Thank you!