Analyzing Genomics Data in R with Bioconductor

Analyzing Genomics Data in R with Bioconductor

Slide deck for Stephanie Hicks from the 2018 DC R Conference (https://dc.rstats.ai)

68c6191fa302627da003b9ac1eaba4b5?s=128

Stephanie Hicks

November 08, 2018
Tweet

Transcript

  1. 1.

    ANALYZING GENOMICS DATA IN R WITH BIOCONDUCTOR Stephanie Hicks Assistant

    Professor, Biostatistics Johns Hopkins Bloomberg School of Public Health #rstatsdc Conference November 8, 2018
  2. 2.

    ABOUT ME Teaching: Data Science Research: Genomics • R/Bioconductor developer

    Other fun things about me: • Co-founded R-Ladies Baltimore • Creating a children’s book featuring women statisticians and data scientists
  3. 8.

    • Open-source, open development software project • Began in 2001

    • Big priorities: reproducible research and high-quality documentation • Vignettes • Diverse community support • Workflows (super helpful for n00bs) • Teaching resources and open development
  4. 9.
  5. 10.

    EXPLORE BIOCONDUCTOR STATISTICS WITH BIOCPKGTOOLS Functions to access metadata around

    Bioc packages and usage in a tidy data format. library(BiocPkgTools) pkgs <- biocDownloadStats() head(pkgs) ## # A tibble: 6 x 6 ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo ## <fct> <int> <fct> <int> <int> <chr> ## 1 ABarray 2018 Jan 117 150 Software ## 2 ABarray 2018 Feb 97 125 Software ## 3 ABarray 2018 Mar 102 121 Software ## 4 ABarray 2018 Apr 229 359 Software ## 5 ABarray 2018 May 99 134 Software ## 6 ABarray 2018 Jun 133 209 Software
  6. 11.

    HOW MANY PACKAGES IN BIOCONDUCTOR? pkgs %>% filter(Year==2018) %>% select(Package,

    repo) %>% distinct() %>% group_by(repo) %>% summarize(total_packages=n()) ## # A tibble: 3 x 2 ## repo total_packages ## <chr> <int> ## 1 AnnotationData 1124 ## 2 ExperimentData 400 ## 3 Software 1733 • Annotation packages = streamlines tedious bookkeeping • Experiment data packages = contains processed data; useful for teaching
  7. 12.

    BIOCONDUCTOR SOFTWARE PACKAGES OVER TIME • • • • •

    • • • • • 600 900 1200 1500 1800 2010 2012 2014 2016 2018 Year Package count Number of Bioconductor Software Packages
  8. 14.

    CREATE GRANGES OBJECT library(GenomicRanges) gr <- GRanges(seqnames = "chr1", strand

    = c("+", "-"), ranges = IRanges(start = c(102012,520211), end=c(120303, 526211)), gene_id = c(1001,2151), score = c(10, 25)) gr ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 102012-120303 + | 1001 10 ## [2] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  9. 15.

    THINGS YOU CAN DO WITH GRANGES OBJECTS width(gr) ## [1]

    18292 6001 gr[gr$score > 15, ] ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome
  10. 16.

    GENOMIC VERBS/ACTIONS + TIDY DATA = PLYRANGES • Goal: Write

    human readable analysis workflows • Idea: Define an API (i.e. extend dplyr) that maps relational genomic algebra to “verbs” that act on ”tidy” genomic data • Another great idea: Borrow dplyr’s syntax and design principles • And another great idea: Compose verbs together with pipe operator from magrittr Stuart Lee Di Cook Michael Lawrence
  11. 17.

    library(plyranges) gr %>% filter(score > 15) ## GRanges object with

    1 range and 2 metadata columns: ## seqnames ranges strand | gene_id score ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr1 520211-526211 - | 2151 25 ## ------- ## seqinfo: 1 sequence from an unspecified genome gr %>% filter(score > 15) %>% width() ## [1] 6001
  12. 18.