Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Genomics Data in R with Bioconductor

Analyzing Genomics Data in R with Bioconductor

Slide deck for Stephanie Hicks from the 2018 DC R Conference (https://dc.rstats.ai)

Stephanie Hicks

November 08, 2018
Tweet

More Decks by Stephanie Hicks

Other Decks in Programming

Transcript

  1. ANALYZING
    GENOMICS DATA IN R
    WITH
    BIOCONDUCTOR
    Stephanie Hicks
    Assistant Professor, Biostatistics
    Johns Hopkins Bloomberg School of Public Health
    #rstatsdc Conference
    November 8, 2018

    View Slide

  2. ABOUT ME
    Teaching: Data Science
    Research: Genomics
    • R/Bioconductor developer
    Other fun things about me:
    • Co-founded R-Ladies Baltimore
    • Creating a children’s book
    featuring women statisticians
    and data scientists

    View Slide

  3. COMPREHENSIVE R ARCHIVE NETWORK (CRAN)

    View Slide

  4. #rstatsdc
    #rladies
    #rstats

    View Slide

  5. https://www.r-project.org/other-projects.html

    View Slide

  6. https://www.r-project.org/other-projects.html

    View Slide

  7. CRAN, MEET YOUR COUSIN
    BIOCONDUCTOR!

    View Slide

  8. • Open-source, open development software project
    • Began in 2001
    • Big priorities: reproducible research and high-quality
    documentation
    • Vignettes
    • Diverse community support
    • Workflows (super helpful for n00bs)
    • Teaching resources
    and open development

    View Slide

  9. View Slide

  10. EXPLORE BIOCONDUCTOR
    STATISTICS WITH BIOCPKGTOOLS
    Functions to access metadata around Bioc packages and
    usage in a tidy data format.
    library(BiocPkgTools)
    pkgs <- biocDownloadStats()
    head(pkgs)
    ## # A tibble: 6 x 6
    ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo
    ##
    ## 1 ABarray 2018 Jan 117 150 Software
    ## 2 ABarray 2018 Feb 97 125 Software
    ## 3 ABarray 2018 Mar 102 121 Software
    ## 4 ABarray 2018 Apr 229 359 Software
    ## 5 ABarray 2018 May 99 134 Software
    ## 6 ABarray 2018 Jun 133 209 Software

    View Slide

  11. HOW MANY PACKAGES IN
    BIOCONDUCTOR?
    pkgs %>%
    filter(Year==2018) %>%
    select(Package, repo) %>%
    distinct() %>%
    group_by(repo) %>%
    summarize(total_packages=n())
    ## # A tibble: 3 x 2
    ## repo total_packages
    ##
    ## 1 AnnotationData 1124
    ## 2 ExperimentData 400
    ## 3 Software 1733
    • Annotation packages = streamlines tedious bookkeeping
    • Experiment data packages = contains processed data; useful for teaching

    View Slide

  12. BIOCONDUCTOR SOFTWARE
    PACKAGES OVER TIME










    600
    900
    1200
    1500
    1800
    2010 2012 2014 2016 2018
    Year
    Package count
    Number of Bioconductor Software Packages

    View Slide

  13. STANDARD BIOC DATA
    STRUCTURE
    GenomicRanges (GRanges)
    Lee et al. (2018), bioRixv

    View Slide

  14. CREATE GRANGES OBJECT
    library(GenomicRanges)
    gr <- GRanges(seqnames = "chr1", strand = c("+", "-"),
    ranges = IRanges(start = c(102012,520211),
    end=c(120303, 526211)),
    gene_id = c(1001,2151),
    score = c(10, 25))
    gr
    ## GRanges object with 2 ranges and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 102012-120303 + | 1001 10
    ## [2] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View Slide

  15. THINGS YOU CAN DO WITH
    GRANGES OBJECTS
    width(gr)
    ## [1] 18292 6001
    gr[gr$score > 15, ]
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View Slide

  16. GENOMIC VERBS/ACTIONS +
    TIDY DATA = PLYRANGES
    • Goal: Write human readable analysis
    workflows
    • Idea: Define an API (i.e. extend dplyr) that
    maps relational genomic algebra to “verbs”
    that act on ”tidy” genomic data
    • Another great idea: Borrow dplyr’s
    syntax and design principles
    • And another great idea: Compose verbs
    together with pipe operator from magrittr
    Stuart Lee
    Di Cook
    Michael Lawrence

    View Slide

  17. library(plyranges)
    gr %>%
    filter(score > 15)
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome
    gr %>%
    filter(score > 15) %>%
    width()
    ## [1] 6001

    View Slide

  18. View Slide

  19. Feel free to send comments/questions:
    Twitter: @stephaniehicks
    Email: [email protected]
    #rstatsdc
    #rladies
    Thank you!

    View Slide