Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing Genomics Data in R with Bioconductor

Analyzing Genomics Data in R with Bioconductor

Slide deck for Stephanie Hicks from the 2018 DC R Conference (https://dc.rstats.ai)

Stephanie Hicks

November 08, 2018
Tweet

More Decks by Stephanie Hicks

Other Decks in Programming

Transcript

  1. ANALYZING
    GENOMICS DATA IN R
    WITH
    BIOCONDUCTOR
    Stephanie Hicks
    Assistant Professor, Biostatistics
    Johns Hopkins Bloomberg School of Public Health
    #rstatsdc Conference
    November 8, 2018

    View full-size slide

  2. ABOUT ME
    Teaching: Data Science
    Research: Genomics
    • R/Bioconductor developer
    Other fun things about me:
    • Co-founded R-Ladies Baltimore
    • Creating a children’s book
    featuring women statisticians
    and data scientists

    View full-size slide

  3. COMPREHENSIVE R ARCHIVE NETWORK (CRAN)

    View full-size slide

  4. #rstatsdc
    #rladies
    #rstats

    View full-size slide

  5. https://www.r-project.org/other-projects.html

    View full-size slide

  6. https://www.r-project.org/other-projects.html

    View full-size slide

  7. CRAN, MEET YOUR COUSIN
    BIOCONDUCTOR!

    View full-size slide

  8. • Open-source, open development software project
    • Began in 2001
    • Big priorities: reproducible research and high-quality
    documentation
    • Vignettes
    • Diverse community support
    • Workflows (super helpful for n00bs)
    • Teaching resources
    and open development

    View full-size slide

  9. EXPLORE BIOCONDUCTOR
    STATISTICS WITH BIOCPKGTOOLS
    Functions to access metadata around Bioc packages and
    usage in a tidy data format.
    library(BiocPkgTools)
    pkgs <- biocDownloadStats()
    head(pkgs)
    ## # A tibble: 6 x 6
    ## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo
    ##
    ## 1 ABarray 2018 Jan 117 150 Software
    ## 2 ABarray 2018 Feb 97 125 Software
    ## 3 ABarray 2018 Mar 102 121 Software
    ## 4 ABarray 2018 Apr 229 359 Software
    ## 5 ABarray 2018 May 99 134 Software
    ## 6 ABarray 2018 Jun 133 209 Software

    View full-size slide

  10. HOW MANY PACKAGES IN
    BIOCONDUCTOR?
    pkgs %>%
    filter(Year==2018) %>%
    select(Package, repo) %>%
    distinct() %>%
    group_by(repo) %>%
    summarize(total_packages=n())
    ## # A tibble: 3 x 2
    ## repo total_packages
    ##
    ## 1 AnnotationData 1124
    ## 2 ExperimentData 400
    ## 3 Software 1733
    • Annotation packages = streamlines tedious bookkeeping
    • Experiment data packages = contains processed data; useful for teaching

    View full-size slide

  11. BIOCONDUCTOR SOFTWARE
    PACKAGES OVER TIME










    600
    900
    1200
    1500
    1800
    2010 2012 2014 2016 2018
    Year
    Package count
    Number of Bioconductor Software Packages

    View full-size slide

  12. STANDARD BIOC DATA
    STRUCTURE
    GenomicRanges (GRanges)
    Lee et al. (2018), bioRixv

    View full-size slide

  13. CREATE GRANGES OBJECT
    library(GenomicRanges)
    gr <- GRanges(seqnames = "chr1", strand = c("+", "-"),
    ranges = IRanges(start = c(102012,520211),
    end=c(120303, 526211)),
    gene_id = c(1001,2151),
    score = c(10, 25))
    gr
    ## GRanges object with 2 ranges and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 102012-120303 + | 1001 10
    ## [2] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View full-size slide

  14. THINGS YOU CAN DO WITH
    GRANGES OBJECTS
    width(gr)
    ## [1] 18292 6001
    gr[gr$score > 15, ]
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome

    View full-size slide

  15. GENOMIC VERBS/ACTIONS +
    TIDY DATA = PLYRANGES
    • Goal: Write human readable analysis
    workflows
    • Idea: Define an API (i.e. extend dplyr) that
    maps relational genomic algebra to “verbs”
    that act on ”tidy” genomic data
    • Another great idea: Borrow dplyr’s
    syntax and design principles
    • And another great idea: Compose verbs
    together with pipe operator from magrittr
    Stuart Lee
    Di Cook
    Michael Lawrence

    View full-size slide

  16. library(plyranges)
    gr %>%
    filter(score > 15)
    ## GRanges object with 1 range and 2 metadata columns:
    ## seqnames ranges strand | gene_id score
    ## |
    ## [1] chr1 520211-526211 - | 2151 25
    ## -------
    ## seqinfo: 1 sequence from an unspecified genome
    gr %>%
    filter(score > 15) %>%
    width()
    ## [1] 6001

    View full-size slide

  17. Feel free to send comments/questions:
    Twitter: @stephaniehicks
    Email: [email protected]
    #rstatsdc
    #rladies
    Thank you!

    View full-size slide