Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LIBD_DS_TLDR

 LIBD_DS_TLDR

Leonardo Collado-Torres

April 20, 2022
Tweet

More Decks by Leonardo Collado-Torres

Other Decks in Science

Transcript

  1. R/Bioconductor-powered
    Team Data Science
    Leonardo Collado-Torres
    Louise A. Huuki-Myers
    Joshua M. Stolz
    Nicholas J. Eagles
    + Geo Pertea
    https://lcolladotor.github.io/bioc_team_ds/
    April 20, 2022
    https://speakerdeck.com/lcolladotor/libd-ds-tldr

    View full-size slide

  2. DNA Genotyping: PopTop
    Joshua M. Stolz

    View full-size slide

  3. Benefit of TopMed
    ● TopMed offers a reference panel for far more
    snps (~300 million)
    ● The Rsq value for lower MAF is preserved in
    populations of african ancestry.
    ● Can this increased power be used to include
    lower MAFs?
    ● Should we just filter by Rsq?
    MAF: minor allele frequency
    Rsq: R squared

    View full-size slide

  4. PopTop:NextFlow Pipeline
    ● Parallelizes computationally expensive tasks
    ● Allows for the automation of large jobs
    ● Documentation is available at:
    ○ https://research.libd.org/Topmed-Imputation-Pipeline/

    View full-size slide

  5. Topmed Output: VCF Format

    View full-size slide

  6. Future Works
    ● Working to make this modular with the upcoming LIBD Data Portal.
    ● Writing scripts to make delivering subsets of samples more time feasible.
    ● Continuing to maintain and update the documentation website to make it more robust
    and user friendly.
    @geo_pertea
    Geo Pertea
    @Nick-Eagles (GH)
    Nicholas J Eagles

    View full-size slide

  7. Bulk RNA-seq Processing:
    SPEAQeasy
    Nicholas J. Eagles

    View full-size slide

  8. Bulk RNA-seq Processing:
    Motivation and Challenges
    - Data processing should be uniform across time/ datasets, documented, and
    reproducible
    - What aligner was used for this dataset?
    - Did we use hg19 or hg38? What GENCODE release was used?
    - How did we decide which samples to trim, if any?
    - Which version of FastQC was used?
    - While many computational steps are involved before analyses (e.g. DE) are possible,
    data pre-processing should ideally not require technical expertise to apply

    View full-size slide

  9. SPEAQeasy Workflow
    https://github.com/LieberInstitute/SPEAQeasy
    Raw sequencing reads
    R objects, ready for statistical analysis

    View full-size slide

  10. Manuscript and
    Documentation
    https://doi.org/10.1186/s12859-021-04142-3 http://research.libd.org/SPEAQeasy/

    View full-size slide

  11. Example Analysis
    - Demonstrate how to run SPEAQeasy on real data and work with its outputs
    - Use variant calling results to find and resolve identity issues originating from labelling mistakes
    - Perform a differential analysis after attaching experiment-specific sample metadata
    http://research.libd.org/SPEAQeasy-example
    @lahuuki
    Louise A Huuki-Myers
    @JoshStolz2
    Joshua M Stolz

    View full-size slide

  12. Configuration
    - Each user recommended to install SPEAQeasy
    separately
    - git clone
    [email protected]
    :LieberInstitute/SPEAQeasy.git
    - cd SPEAQeasy
    - bash install_software.sh "jhpce"
    - Control:
    - exact annotation files used
    (GENCODE/ Ensembl versions)
    - command-line arguments to software
    - how to trim samples, if at all
    - aligner (HISAT2/ STAR) and
    pseudo-aligner (kallisto/ salmon)
    /dcs04/lieber/ds2a/Data/CMC/Data/RNAseq/Raw/SPEAQeasy

    View full-size slide

  13. Recent Improvements and Future Work
    Post-publication improvements:
    - Alignment-related optimizations resulting in reduction in disk space and
    computational time
    - Support for singularity leading to new Cardiff users and greatly expanding possible
    users (collaboration with Nick Clifton)
    - Added raw counts for transcripts in addition to TPM
    Future improvements:
    - Want to allow a user to only perform alignment, or only quantify transcripts
    - Reduce required memory when counting junctions to produce R objects
    @geo_pertea
    Geo Pertea

    View full-size slide

  14. WGBS Processing:
    BiocMAP
    Nicholas J. Eagles

    View full-size slide

  15. WGBS: Motivation and Challenges
    - Profile DNA methylation, a critical epigenetic modification, across the entire human
    genome
    - Differentially methylated regions (DMRs), e.g. between schizophrenia and controls
    - Methylation quantitative trait loci (MeQTLs)
    - Large data size (> 1B cytosines measured, ~2TB disk space per sample)
    - Careful choices must be made to fit files on JHPCE, generate/load results with available
    memory
    - Several steps are required before raw sequenced reads yield methylation proportions
    ready for analysis
    - Trimming, alignment to reference genome, extract methylation proportions, import to R

    View full-size slide

  16. TL;DR
    Raw sequencing reads
    R objects, ready for statistical analysis
    https://github.com/LieberInstitute/BiocMAP

    View full-size slide

  17. Manuscript and
    Documentation
    http://research.libd.org/BiocMAP/

    View full-size slide

  18. Example Vignette
    https://github.com/LieberInstitute/BiocMAP/blob/master/
    documentation/example_analysis/age_neun_analysis.pdf
    https://github.com/LieberInstitute/BiocMAP/blob/master/
    documentation/example_analysis/example_analysis.pdf
    Price et al., BMC Genome Biology 2019
    https://doi.org/10.1186/s13059-019-1805-1

    View full-size slide

  19. Workflow in Practice
    - Datasets where we’ve applied BiocMAP:
    - 664 PsychENCODE Schizophrenia/control samples (Hippocampus, DLPFC,
    Caudate)
    - 20 PsychENCODE fetal samples
    - 2597 VA PTSD samples (year 1 and 2)
    - How long does it take to run?
    - ~3 months for 648 samples (second module)
    - ~2 weeks for 43 samples (both modules at JHPCE)
    - How much disk space is required?
    - ~2TB disk space per sample while generating; 1TB outputs

    View full-size slide

  20. LIBD Data Portal
    Geo Pertea

    View full-size slide

  21. LIBD Data integration
    • relational database tracking data assets at LIBD
    • linking LIMS to processed data
    • flexible database back-end & indexed file storage
    • unified web interface for data queries

    View full-size slide

  22. brains histological
    samples
    extraction sequencing
    sequencing
    samples
    processing
    id
    brnum
    brint
    age
    sex
    race
    dx_id
    subjects
    id
    name
    subj_id
    region
    sdate
    samples
    id
    dataset_id
    s_id
    s_name
    sample_id
    protocol
    restricted
    numReads
    numMapped
    totalMapped
    overallMapRate
    ...
    mitoRate
    rRNA_rate
    totalAssignedGene
    exp_metadata
    exp_id
    dtype
    ftype G / E / J / T
    f_set_id
    f_data real [ ]
    version
    exp_data
    Experiment data flow
    H5 filesystem
    PostgreSQL
    database
    assay data
    Parquet

    View full-size slide

  23. PostgreSQL relational database
    demographic
    data
    experiment metadata
    histological sample
    metadata
    genomic features
    (annotations)
    assay data
    id
    subj_id
    dnum
    sample_id
    panel_id
    batch_id
    call_rate
    p10gc, p50gc
    nPennCNV [ ]
    SUM16,SUM20
    imputation
    data_path
    genotype
    location of data files
    on file system storage

    View full-size slide

  24. Integration of PostgreSQL and R from back-end to front-end
    Leveraging R’s data processing and visualization capabilities
    SQL + R code:
    Front-end (web application)
    Back-end PostgreSQL server
    client selects dataset
    (sample metadata only)
    client receives
    results & plot data
    middleware
    (nodejs)
    retrieve sample data
    process large data
    output results
    SQL / R server returns
    results & baked plot data
    (plotly JSON )
    srv16

    View full-size slide

  25. sc/snRNA-seq
    Louise A. Huuki-Myers
    Joshua M. Stolz
    @mattntran
    Matthew N Tran
    With help from:

    View full-size slide

  26. https://bioconductor.org/packages/3.14/SingleCellExperiment

    View full-size slide

  27. https://doi.org/10.1038/s41592-019-0654-x
    https://bioconductor.org/books/release/OSCA
    @stephaniehicks
    Stephanie C Hicks

    View full-size slide

  28. Quality control + normalization
    ● emptyDrops() from DropletUtils
    ○ Determine the empty droplets
    ● isOutlier() from scran
    ○ Identify outlier cells/nuclei based on mitochondrial expression and other
    metrics
    ● devianceFeatureSelection()+ nullResiduals() from scry
    ○ GLM-PCA approximation by Townes, Hicks, Ayree, and Irizarry
    https://doi.org/10.1186/s13059-019-1861-6
    ● reduceMNN() from batchelor
    ○ Batch correction since sc/snRNA-seq has strong sample effects
    ● + much more before you get to annotated clusters of cells
    @mattntran
    Matthew N Tran
    @Erik-D-Nelson (GH)
    Erik D Nelson

    View full-size slide

  29. 1vAll Markers vs. Mean Ratio Markers
    29
    https://research.libd.org/DeconvoBuddies/
    @lahuuki
    Louise A Huuki-Myers

    View full-size slide

  30. Deconvolution
    Louise A. Huuki-Myers

    View full-size slide

  31. ● Inferring the composition of different cell types in a bulk RNA-seq data
    What is Deconvolution?
    Tissue
    Bulk RNA-seq
    snRNA-seq
    Estimated proportions
    31
    Deconvolution
    Get single cell like
    information from bulk
    RNA-seq
    $$$
    $
    Free!
    https://twitter.com/BoXia7/status/1261464021322137600

    View full-size slide

  32. Mean Proportions By Region: Tran et al, bioRxiv, 2020 (5 donors, 6 cell types)

    View full-size slide

  33. Peric =
    Mural + Endo
    Mean Proportions By Region: Tran et al, Neuron, 2021 (8 donors, 10 cell types)

    View full-size slide

  34. ● Bisque has more similar
    pattern of composition over
    regions vs. SPLITR
    ● MuSiC predicts large
    proportions of Endo + Mural
    (Peric)
    ● Both estimate lower
    proportions of Excit
    ○ MuSiC is more extreme
    and also predicts low
    portion Inhib
    Bisque & MuSiC vs SPLITR
    Different deconvolution methods, bulk RNA-seq data source,
    marker genes, and reference snRNA-seq data

    View full-size slide

  35. ● Run with set of 20
    & 25 marker genes
    per cell type
    ● Bisque is more
    robust to changes
    in the marker set
    than MuSiC
    Method Sensitivity to Marker Set
    25 vs. 20 Genes
    Currently Bisque is our
    method of choice

    View full-size slide

  36. Dataset Regions Samples Case Control Analysis Publication
    BipSeq sACC + AMY 511 247 BPD 264 Revisions
    Zandi et al., Nat.
    Neurosci, 2022
    Suicide Genomics DLPFC 329 226 103 Revisions
    Punzi et al.,
    American Journal of
    Psychiatry, 2022
    BrainSeq Phase III Caudate 464 298 SCZD 266 Revisions
    Benjamin et al.,
    Nature
    Neuroscience, 2022
    MDDseq sACC + AMY 1091 704 MDD/BPD 387 Main In Progress
    AANRI
    DG, Caudate, Hippo,
    DLPFC
    1647
    (263, 464,
    447, 453)
    - - Main In Progress
    Astellas AD Main In Progress
    BrainSeq Phase I DLPFC 727 395 SCZD 332 Exploratory -
    BrainSeq Phase II DLPFC 453 153 SCZD 300 Exploratory -
    GTEx 13 Regions 2670 - - Exploratory -
    Degradation
    AMY, Caudate,
    DLPFC, HIPPO,
    mPFC, sACC
    119 - - Exploratory -

    View full-size slide

  37. Upcoming: Deconvolution Methods Benchmark
    ● Goal: determine the most accurate deconvolution method for brain bulk RNA-seq
    data
    ○ Test available softwares (Bisque, MuSiC, and others) over a variety of conditions
    ■ Reference set qualities
    ■ Marker Genes selection
    ■ Preparation of the bulk data
    ● Requires: A “gold standard” cell type composition reference to measure
    performance
    ○ snRNA-seq can be enriched for certain cell types
    ○ smFISH + RNAscope allows “direct” measurement from intact tissue, will be used to establish
    true composition

    View full-size slide

  38. Bulk RNA-seq
    Goals for RNAscope Experiment
    ● Deconvolution R01 MH123183
    ○ Kristen Maynard, Stephanie C Hicks
    ● Use six slices of DLPFC to generate corresponding
    RNA-seq & RNAscope data
    ● This information will be useful to evaluate and
    design deconvolution algorithms
    DLPFC
    Bulk RNA-seq
    snRNA-seq
    Spatial
    RNAscope
    RNAscope
    38
    polyA
    RiboZero
    @kr_maynard
    Kristen R Maynard
    @stephaniehicks
    Stephanie C Hicks
    Kelsey D Montgomery

    View full-size slide

  39. What is a TREG?
    ● Total RNA Expression Gene
    ● Expression is proportional to the overall RNA
    expression in a cell
    ● In smFISH the count of TREG puncta in a cell can
    estimate the RNA content
    ○ Linking RNA content to nucleus size
    http://research.libd.org/TREG/
    http://bioconductor.org/packages/TREG/

    View full-size slide

  40. eQTLs
    Louise A. Huuki-Myers

    View full-size slide

  41. Key inputs
    ● Genotype Data
    ○ Consider minor allele frequency
    ○ Full topMed imputed SNP data set
    ○ Risk SNP subset
    ● Expression Data
    ○ Gene, exon, junction, transcript
    ○ Position of the feature
    ● Covariates Data
    ○ Phenotype data: Dx, Age, Sex
    ○ Feature PCs
    ● Interaction Data
    ○ Example: cell fractions from deconvolution
    ● Parameters
    ○ Window size
    ○ Minor allele frequency
    PopTop
    Genotype data
    SPEAQeasy generated
    Summarizedexperiment
    TensorQTL
    + parameters
    Deconvolution
    or other analysis
    Covariate Data as
    matrix
    Plink files containing
    SNPs of interest
    Interaction vector
    Feature position +
    expression matrix
    Only for interaction analysis
    eQTL results

    View full-size slide

  42. MatrixEQTL vs tensorQTL (fastQTL)
    MatrixEQTL
    ● R package
    ● Many Andrew E Jaffe analyses:
    ○ BrainSEQ Phase II
    ○ Burke et al stem cell
    ○ …
    ○ BipSeq by Zandi et al
    tensorQTL
    ● Python, GPU enabled
    ● Currently utilized in MDDseq project
    ● Recommended upgrade by Andrew Jaffe,
    utilized by other LIBD researchers
    ● github.com/broadinstitute/tensorqtl
    https://youtu.be/zOMU
    XYHtVJM

    View full-size slide

  43. Genome-wide eQTLs: several flavors
    ● Nominal: evaluate all pairs
    ● Cis: find most significant pair
    per feature
    ● Independent: conditionally
    independent cis-QTLs using
    stepwise regression

    View full-size slide

  44. tensorQTL at JHPCE (GPU-powered)
    Data Formatting
    ● Genotype Data
    ○ Needs .bed/.bim/.bam files
    ● Expression Data
    ○ Gene, exon, junction, transcript
    ○ As .bed.gz
    ● Covariates Data
    ○ Phenotype data: Dx, Age, Sex, Feature
    PCs
    ■ Categorical variables must be
    converted to numeric
    ○ File type flexible, need to read in as
    pandas.DataFrame
    How to Run on GPU
    ● Can be used as a function in python
    script or as command line tool
    ○ Requires conversion to correct data
    formats
    ● Fast when run on GPU
    ○ Completed MDDseq Amygdala Gene
    analysis in 2.52 min vs 51.21 min on
    CPU (vs. 288 min matrixEQTL)
    ■ 540 samples x 53.6M pairs
    ● Use GPU queue when submitting job
    ○ Example sh file
    #$ -l gpu,mem_free=50G,h_vmem=50G,h_fsize=100G

    View full-size slide

  45. GWAS-loci eQTL analysis
    ● Subset genotype dataset to SNPs identified as risk
    loci by GWAS
    ● Check for association with cellular fractions
    predicted by deconvolution
    ○ Run nominal analysis w/ addition of interaction
    vector
    ○ Adds interaction term to the model
    ■ p ~ g + i + gi
    PGC Major Depressive Disorder GWAS
    Wray et al. Nature Genetics, 2018
    Deconvolution Results

    View full-size slide

  46. Interaction eQTLs with cell type proportions
    https://github.com/LieberInstitute/goesHyde_mdd_rnaseq/tree/master/eqtl/code

    View full-size slide

  47. Quality Surrogate Variable Analysis
    (qSVA)
    Joshua M. Stolz

    View full-size slide

  48. Differential expression is confounded by degradation
    The t-statistics between SCZ
    vs Control and degradation
    time DE are correlated.
    Traditional methods (like
    RIN) fail to remove this
    affect.
    Jaffe AE, Tao R, Norris AL, Kealhofer M, Nellore A, Shin JH, et al. qSVA framework for RNA quality correction in
    differential expression analysis. Proc Natl Acad Sci U S A. 2017;114:7130–5.

    View full-size slide

  49. qSVA Original Process
    Each sample was allowed to degrade on a bench for
    0,15,30,60 minutes.
    From this we get the top 1000 expressed regions associated
    with degradation.
    Peterson, Amy. “Quality Surrogate Variable Analysis.” LIBD Rstats Club, LIBD Rstats Club, 11 Dec. 2018,
    research.libd.org/rstatsclub/2018/12/11/quality-surrogate-variable-analysis/

    View full-size slide

  50. Updated pipeline
    2000

    View full-size slide

  51. Degradation is confounded by Region

    View full-size slide

  52. Deconvolution
    @lahuuki
    Louise A Huuki-Myers

    View full-size slide

  53. Deconvolution in Degradation Matrix
    ● Identify 2,976 degradation associated transcripts with cell proportion terms in model (vs. 1,792)
    ● Controlling expression for qSVs predicted with this set of transcripts shows lower correlations between DE
    results and degradation statistic (desired result)
    Cor = -0.091 Cor = -0.051

    View full-size slide

  54. http://research.libd.org/qsvaR
    http://bioconductor.org/packages/qsvaR/
    @HeenaDivecha
    Heena R Divecha
    With ongoing
    feedback on the
    documentation
    from:

    View full-size slide

  55. Differential Gene Expression
    Louise A. Huuki-Myers

    View full-size slide

  56. Key inputs
    ● Quality Controlled Expression Data
    ● Model & corresponding data
    ○ Primary Dx (explanatory variable)
    ○ Phenotype data (AgeDeath, Sex)
    ○ Quality control metrics
    ■ mitoRate, rRNA_rate, totalAssignedGene, RIN, ERCC
    ○ snpPCs
    ■ from DNA Genotyping
    ○ qSVs
    ■ from qSVA v1 or v2 (qsvaR)
    ○ Other Analysis
    ■ E.x. Deconvolution cell fractions
    ~Dx + pd + QC + snpPC + qSVs + ?
    Model
    Deconvolution or
    other analysis
    qsvaR
    Degradation matrix
    PopTop
    Genotype data
    SPEAQeasy generated
    Summarizedexperiment
    Model Matrix
    Normalized
    counts
    limma + voom process
    lmFit()
    eBayes()
    topTable()
    DE Results
    calcNormFactors()
    model.matrix()

    View full-size slide

  57. Modeling with limma: quick overview
    ● calcNormFactors() from edgeR
    ○ For normalization of the bulk RNA-seq counts
    ● model.matrix() from stats
    ○ Define how you want to model gene expression
    ○ Covariates like qSVs, ancestry PCs, SPEAQeasy QC metrics, sex, age, diagnosis, …
    ● lmFit()
    ○ Fit the linear regression model for all genes
    ● eBayes()
    ○ Use empirical Bayes to compute the statistics
    ● topTable()
    ○ Extract results for downstream analyses
    More details at
    http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
    edgeR,
    DESeq2,
    dream are
    good
    alternatives

    View full-size slide

  58. Why limma + voom?
    ● Limma utilizes linear regression vs. DESeq2 utilizes negative binomial distribution
    ○ Comparable results
    ○ Limma is less computationally expensive (faster)
    ● Other methods:
    ○ DREAM: linear mixed effect model (Hoffman et al. Bioinformatics, 2021)
    ■ More precise but computationally expensive
    ■ May be better for analysis with small sample sizes

    View full-size slide

  59. https://github.com/LieberInstitute/goesHyde_mdd_rnaseq/blob/master/differential_expression/cod
    e/run_DE.R
    Example DE code with limma

    View full-size slide

  60. Interactively
    explore
    your model.matrix
    http://bioconductor.org/packages/
    ExploreModelMatrix
    https://doi.org/10.12688/f1000re
    search.24187.2
    ● Important for exploring the
    DE model

    View full-size slide

  61. Adding Cell Fraction to DE Model
    ● Including Deconvolution can result in a
    more conservative model
    ○ For the most part similar t-statistics
    ○ Fever significant DE genes

    View full-size slide

  62. Spatially-resolved
    transcriptomics
    Leonardo Collado-Torres
    With help from:
    @abspangler
    Abby Spangler
    @lmwebr
    Lukas M Weber
    @stephaniehicks
    Stephanie C Hicks
    @MadhaviTippani
    Madhavi Tippani
    @sowmyapartybun
    Sowmya Parthiban
    @HeenaDivecha
    Heena R Divecha
    @PardoBree
    Brenda Pardo
    @Nick-Eagles (GH)
    Nicholas J Eagles
    @martinowk
    Keri Martinowich
    @kr_maynard
    Kristen R Maynard
    @CerceoPage
    Stephanie C Page

    View full-size slide

  63. 63
    SpatialExperiment: infrastructure for spatially resolved
    transcriptomics data in R using Bioconductor
    Righelli, Weber, Crowell, et al, bioRxiv, 2021
    Accepted at Oxford Bioinformatics on 04/19/2022
    DOI 10.1101/2021.01.27.428431
    Dario Righellli Helena L Crowell
    @drighelli @CrowellHL
    Lukas M Weber
    @lmwebr

    View full-size slide

  64. bioconductor.org/packages/spatialLIBD
    Pardo et al, bioRxiv, 2021 DOI 10.1101/2021.04.29.440149
    Accepted at BMC Genomics on 04/20/2022
    Maynard, Collado-Torres, Nat Neuro, 2021
    Brenda Pardo Abby Spangler
    @PardoBree @abspangler

    View full-size slide

  65. http://research.libd.org/spatialLIBD/articles/TenX_data_download.html

    View full-size slide

  66. OSTA:
    https://lmweber.org/OSTA-book/
    @lmwebr
    Lukas M Weber
    @lcolladotor
    Leonardo Collado-Torres
    @abspangler
    Abby Spangler
    @HeenaDivecha
    Heena R Divecha
    @MadhaviTippani
    Madhavi Tippani
    @stephaniehicks
    Stephanie C Hicks

    View full-size slide

  67. 67
    Data-driven clustering: BayesSpace
    Zhao et al, Nature Biotechnology, 2021 https://doi.org/10.1038/s41587-021-00935-2

    View full-size slide

  68. Spatial registration of your sc/snRNA-seq data
    Your sc/snRNA-seq data
    Our spatial data
    Hodge et al, Nature, 2019
    Maynard, Collado-Torres, Nat Neuro, 2021

    View full-size slide

  69. Spot deconvolution: Tangram
    https://www.nature.com/articles/s41592-021-01264-7/figures/1 @Nick-Eagles (GH)
    Nicholas J Eagles

    View full-size slide

  70. Our Philosophy + Getting Help

    View full-size slide

  71. Share knowledge openly
    ● As an independent researcher, my team and I are not a data science core, yet we share our
    knowledge openly so others can get up to speed if needed
    ○ Share research results early through pre-prints (bioRxiv)
    ○ Share code on GitHub with others
    ■ GitHub is widely used as a social coding platform
    ○ Share code snippets that might be useful to others
    ○ Share our experiences
    ○ Maintain and share information on several Slack communication channels
    ○ People are free to adapt what we have done and we would love to learn about what others
    have come up with, since we might need to update/change our own work
    ■ We do not impose solutions or make decisions for others

    View full-size slide

  72. Different types of help
    ○ Things we can do
    ■ Guidance, feasibility, and/or brainstorming
    ■ Data processing like with bulk RNA-seq with SPEAQeasy, DNA genotyping
    with PopTop, WGBS with BiocMAP, …
    ● We would strongly prefer that others learn how to run these tools
    ○ Aka, please use the documentation we wrote =)
    ■ Sharing data with external collaborators
    ● After internal LIBD approval by Rujuta Narurkar
    ○ Things that are beyond what we can typically do
    ■ Lead analysis
    ■ Develop and/or maintain custom solutions
    ■ Write papers

    View full-size slide

  73. Data Science guidance sessions (DSgs)
    ● https://lcolladotor.github.io/bioc_team_ds/data-science-guidance-sessions.html
    ○ JHPCE
    ○ R
    ○ Bioconductor
    ○ Understanding code we wrote
    ○ Training others on how they can more effectively get help from us or others
    ■ Providing reproducible examples: reprex in R https://reprex.tidyverse.org/
    which provides a solution to “help me help you”
    ■ Framing questions and software bug reports
    ● The DSgs system works best over the long term
    ○ It’s based on my 3 yr experience as an JHBSPH MpH capstone teaching assistant

    View full-size slide

  74. https://jhpce.jhu.edu/knowledge-base/knowledge-base-articles-from-lieber-institute/
    Join us Fridays at 9 AM (check the code of conduct
    please!)

    View full-size slide

  75. https://www.youtube.com/c/LeonardoColladoTorres/playlists
    Videos allow us to multiply
    ourselves
    We can make you custom
    selections of videos for a
    specific problem on DSgs
    sessions

    View full-size slide

  76. https://github.com/

    View full-size slide

  77. https://github.com/LieberInstitute
    Email Bill Ulrich your GitHub
    username to get added
    @ckbehemoth (GH)
    William S Ulrich

    View full-size slide

  78. https://github.com/search?q=org%3ALieberInstitute
    Example question:
    How do you use aggregateAcrossCells() ?

    View full-size slide

  79. https://github.com/search?q=org%3ALieberInstitute+aggregateAcrossCells&type=code
    GitHub is our
    library /
    encyclopedia
    It could be
    yours / LIBD’s
    too!
    Code is the
    ultimate
    documentation
    Git commit
    messages
    remind you of
    what you were
    thinking when
    you made a
    change
    Bill Ulrich or Leo can give you access

    View full-size slide

  80. Project 1
    ● https://github.com/LieberInstitute/HumanPilot/blob/
    master/Analysis/Layer_Guesses/layer_specificity.R
    ● https://github.com/LieberInstitute/HumanPilot/blob/
    master/Analysis/Layer_Guesses/asd_snRNAseq_re
    cast.R
    Project 2
    ● https://github.com/LieberInstitute/spatialDLPFC/blo
    b/main/code/analysis/07_spatial_registration/07_sp
    atial_registration.R
    Project 3
    ● https://github.com/LieberInstitute/Visium_IF_AD/blo
    b/master/code/10_spatial_registration/01_spatial_r
    egistration.R
    On the horizon:
    A new function at
    https://github.com/LieberInstitute/spatialLIBD
    @sowmyapartybun
    Sowmya Parthiban
    @abspangler
    Abby Spangler
    Code is constantly
    adapted and improved,
    both within and across
    projects
    Spatial registration code
    example
    It’s hard to keep track of
    code evolution
    ● Try to include
    comments linking back
    to where you adapted it
    from
    Basically: divide and
    conquer ^^

    View full-size slide

  81. @lcolladotor
    Leonardo Collado-Torres
    @lahuuki
    Louise A Huuki-Myers
    @JoshStolz2
    Joshua M Stolz
    @Nick-Eagles (GH)
    Nicholas J Eagles
    @geo_pertea
    Geo Pertea
    @abspangler
    Abby Spangler
    @mattntran
    Matthew N Tran
    @lmwebr
    Lukas M Weber
    @stephaniehicks
    Stephanie C Hicks
    @MadhaviTippani
    Madhavi Tippani
    @sowmyapartybun
    Sowmya Parthiban
    + Many more
    LIBD, JHU, and
    external
    collaborators
    @PardoBree
    Brenda Pardo
    @HeenaDivecha
    Heena R Divecha
    @ckbehemoth (GH)
    William S Ulrich
    @martinowk
    Keri Martinowich
    @kr_maynard
    Kristen R Maynard
    @CerceoPage
    Stephanie C Page

    View full-size slide