Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incorporating biological information into genomic prediction models

Andrea Rau
February 07, 2023

Incorporating biological information into genomic prediction models

VistaMilk Artificial Intelligence in Agriculture Masterclass
February 8, 2023 (online)
https://www.vistamilk.ie/event/artificial-intelligence-in-agriculture-masterclass/

Andrea Rau

February 07, 2023
Tweet

More Decks by Andrea Rau

Other Decks in Science

Transcript

  1. Incorporating biological information into genomic prediction models Fanny Mollandin, Pascal

    Croiseau, Andrea Rau VistaMilk Artificial Intelligence in Agriculture Masterclass @ Zoom February 8, 2023 [email protected] Biological priors in genomic prediction models 1 / 21
  2. Introduction Context Genomic selection overview Objective: select the best animals

    for reproduction to obtain genetic improvement of the population on traits of interest [email protected] Biological priors in genomic prediction models 2 / 21
  3. Introduction Context Genomic selection overview Objective: select the best animals

    for reproduction to obtain genetic improvement of the population on traits of interest Low- to high-density genotyping chips (10k-100k SNPs) → whole genome sequencing (10MM SNPs) [email protected] Biological priors in genomic prediction models 2 / 21
  4. Introduction Context Genomic selection overview Objective: select the best animals

    for reproduction to obtain genetic improvement of the population on traits of interest Low- to high-density genotyping chips (10k-100k SNPs) → whole genome sequencing (10MM SNPs) Image: F. Mollandin [email protected] Biological priors in genomic prediction models 2 / 21
  5. Introduction Context Genomic selection overview Objective: select the best animals

    for reproduction to obtain genetic improvement of the population on traits of interest Low- to high-density genotyping chips (10k-100k SNPs) → whole genome sequencing (10MM SNPs) Image: F. Mollandin [email protected] Biological priors in genomic prediction models 2 / 21
  6. Introduction Context Prediction models for genomic selection Goal: given a

    training set of data (Yi , Xi , Zi ) for i = 1, . . . , n individuals Yi = trait Xi = vector of (usually genome-wide) genotypes Zi = vector of covariates (age, location, sex, ...) ... predict the unobserved trait Y⋆ of a future individual with corresponding X⋆ and Z⋆ [email protected] Biological priors in genomic prediction models 3 / 21
  7. Introduction Context Prediction models for genomic selection Goal: given a

    training set of data (Yi , Xi , Zi ) for i = 1, . . . , n individuals Yi = trait Xi = vector of (usually genome-wide) genotypes Zi = vector of covariates (age, location, sex, ...) ... predict the unobserved trait Y⋆ of a future individual with corresponding X⋆ and Z⋆ Introduced by Meuwissen et al. (2001) Successfully implemented in many plant/animal breeds for traits related to production, health, climate adaptation, ... Modest gains in predictions can have large economic impacts (reduced generation interval, reduced cost and labor for phenotyping) [email protected] Biological priors in genomic prediction models 3 / 21
  8. Introduction Context Challenges of genomic prediction models Non-random association between

    alleles at neighboring loci (aka LD) Polygenic nature of complex traits Many more SNPs (variables) than individuals (observations) ⇒ curse of dimensionality Including too many predictors in a model risks over-fitting, poor generalizability, and problems with model estimation ... but including only a small pre-identified subset of SNPs (e.g., significant GWAS hits) usually leads to poor predictions → Balance computational/statistical feasibility and biologically realistic assumptions [email protected] Biological priors in genomic prediction models 4 / 21
  9. Introduction Context Challenges of genomic prediction models Non-random association between

    alleles at neighboring loci (aka LD) Polygenic nature of complex traits Many more SNPs (variables) than individuals (observations) ⇒ curse of dimensionality Including too many predictors in a model risks over-fitting, poor generalizability, and problems with model estimation ... but including only a small pre-identified subset of SNPs (e.g., significant GWAS hits) usually leads to poor predictions → Balance computational/statistical feasibility and biologically realistic assumptions Can genomic prediction models be improved by better accounting for our knowledge about the function of certain regions of the genome? [email protected] Biological priors in genomic prediction models 4 / 21
  10. Introduction Functional annotations Context: H2020 GENE-SWitCH project The regulatory GENomE

    of Swine & Chicken: functional annotation during development High-quality richly annotated maps of pig and chicken genomes: Development: early/late organogenesis, new born/hatched, adult Sexes: {M,F} × 3 biological replicates Tissues: liver, skeletal muscle, small intestine, cerebellum, dorsal epidermis, lung, kidney Assays: RNA-seq, ATAC-seq, ChIP-seq, smRNA-seq, methylation, Hi-C [email protected] Biological priors in genomic prediction models 5 / 21
  11. Introduction Functional annotations Context: H2020 GENE-SWitCH project The regulatory GENomE

    of Swine & Chicken: functional annotation during development High-quality richly annotated maps of pig and chicken genomes: Development: early/late organogenesis, new born/hatched, adult Sexes: {M,F} × 3 biological replicates Tissues: liver, skeletal muscle, small intestine, cerebellum, dorsal epidermis, lung, kidney Assays: RNA-seq, ATAC-seq, ChIP-seq, smRNA-seq, methylation, Hi-C But how? [email protected] Biological priors in genomic prediction models 5 / 21
  12. Introduction Models for genomic prediction First, back to basics: the

    linear model The workhorse of genomic prediction is the multiple linear regression model: Y = Zθ+Xβ + ε Y = n-vector of traits Z = n × m matrix of covariates θ = m-vector of covariate effect parameters X = n × p matrix of (suitably coded) genotypes β = p-vector of genetic effect parameters ε = n-vector of errors representing noise, assumed to be iid and (usually) normally distributed [email protected] Biological priors in genomic prediction models 6 / 21
  13. Introduction Models for genomic prediction Bayesian methods for genomic prediction

    Image: 10.1007/s10681-007-9516-1 [email protected] Biological priors in genomic prediction models 7 / 21
  14. Introduction Models for genomic prediction Bayesian methods for genomic prediction

    Image: 10.1007/s10681-007-9516-1 likelihood × prior n i=1 N  Yi |  µ + p j=1 Xij βj   , σ2   × p(σ2) p j=1 p(βj |Ψ) σ2 often assigned a χ−2 prior distribution Choice of prior for βj should ideally reflect a trait’s genetic architecture (and be computationally feasible...) [email protected] Biological priors in genomic prediction models 7 / 21
  15. Introduction Models for genomic prediction Which prior to use for

    βj ? Image: 10.1543/genetics.112.143313 [email protected] Biological priors in genomic prediction models 8 / 21
  16. Introduction Models for genomic prediction Which prior to use for

    βj ? Image: 10.1543/genetics.112.143313 GBLUP: βi ∼ N(0, σ2 β ) [email protected] Biological priors in genomic prediction models 8 / 21
  17. Introduction Models for genomic prediction Which prior to use for

    βj ? Image: 10.1543/genetics.112.143313 GBLUP: βi ∼ N(0, σ2 β ) BayesA: βi ∼ N(0, σ2 βi ), σ2 βi ∼ Inv χ2(ν, S2) BayesB: βi ∼ N(0, σ2 βi ), σ2 βi ∼ πδ(0) + (1 − π)Inv χ2(ν, S2), π fixed [email protected] Biological priors in genomic prediction models 8 / 21
  18. Introduction Models for genomic prediction Which prior to use for

    βj ? Image: 10.1543/genetics.112.143313 GBLUP: βi ∼ N(0, σ2 β ) BayesA: βi ∼ N(0, σ2 βi ), σ2 βi ∼ Inv χ2(ν, S2) BayesB: βi ∼ N(0, σ2 βi ), σ2 βi ∼ πδ(0) + (1 − π)Inv χ2(ν, S2), π fixed BayesC: βi ∼ πδ(0) + (1 − π)N(0, σ2 β ), σ2 β ∼ Inv χ2(ν, S2) , π fixed BayesCπ: BayesC with π ∼ Unif(0, 1) [email protected] Biological priors in genomic prediction models 8 / 21
  19. Introduction Models for genomic prediction BayesR (Erbe et al., 2012)

    π ∼ Dirichlet(α), with α = (1, 1, 1, 1) Gibbs sampler for estimation [email protected] Biological priors in genomic prediction models 9 / 21
  20. Introduction Incorporating disjoint annotations Back to annotations: BayesRC (MacLeod et

    al., 2016) SNPs assigned to disjoint “annotations”, model is a factorized BayesR πc ∼ Dirichlet(α), with α = (1, 1, 1, 1) Gibbs sampler for estimation [email protected] Biological priors in genomic prediction models 10 / 21
  21. BayesRCO models Overview From BayesR to BayesRC ... and beyond

    [email protected] Biological priors in genomic prediction models 11 / 21
  22. BayesRCO models Overview From BayesR to BayesRC ... and beyond

    [email protected] Biological priors in genomic prediction models 11 / 21
  23. BayesRCO models Overview From BayesR to BayesRC ... and beyond

    [email protected] Biological priors in genomic prediction models 11 / 21
  24. BayesRCO models Overview From BayesR to BayesRC ... and beyond

    [email protected] Biological priors in genomic prediction models 11 / 21
  25. BayesRCO models Overview From BayesR to BayesRC ... and beyond

    [email protected] Biological priors in genomic prediction models 11 / 21
  26. BayesRCO models Model definition BayesRCO: BayesRC for Overlapping annotations Two

    hypotheses = two models! 1 Multi-annotations represent added confidence→ BayesRC+ 2 Multi-annotations represent uncertainty → BayesRCπ [email protected] Biological priors in genomic prediction models 12 / 21
  27. Simulations Results BayesRCπ assigns informative annotations to QTLs h2 =

    0.5, k = 1%, scenario A PAIP = posterior annotation inclusion probability (BayesRCπ output) [email protected] Biological priors in genomic prediction models 15 / 21
  28. Simulations Results BayesRC+ assigns more weight to multi-annotated variants h2

    = 0.5, k = 1%, scenario C [email protected] Biological priors in genomic prediction models 16 / 21
  29. Real data analysis Description Application in backcross population of growing

    pigs n = 1297 backcross pigs (3/4 Large-White, 1/4 Creole), genetically related sows sired with 10 boars Genotyped with Illumina Porcine 60k BeadChip array Sibling-structured 10-fold cross validation procedure Traits pre-corrected for age, sex, farm Focus on average daily weight gain (ADG) and backfat thickness (BFT) at 23 weeks [email protected] Biological priors in genomic prediction models 17 / 21
  30. Real data analysis Results Correlation of predicted traits in pig

    validation data Annotations constructed using pigQTLdb for 11 trait sub-hierarchies Anatomy, behavioral, blood parameters, conformation, fatness, fatty acid content, feed conversion, growth, immune capacity, litter, reproductive organs Nearest up- and downstream neighboring markers also annotated [email protected] Biological priors in genomic prediction models 18 / 21
  31. Real data analysis Results Correlation of predicted traits in pig

    validation data Annotations constructed using pigQTLdb for 11 trait sub-hierarchies Anatomy, behavioral, blood parameters, conformation, fatness, fatty acid content, feed conversion, growth, immune capacity, litter, reproductive organs Nearest up- and downstream neighboring markers also annotated [email protected] Biological priors in genomic prediction models 18 / 21
  32. Wrapping up... Conclusions: incorporating annotations with BayesRCO BayesRCO: → BayesRCπ

    can assign informative annotations to multi-annotated SNPs to account for uncertainty in prior knowledge → BayesRC+ upweights multi-annotated SNPs and is robust to various annotation scenarios Fairly modest improvements in prediction (∼1-2 points) observed when incorporating biological annotations Improved predictions and rankings of large QTLs in simulations, especially for highly informative annotations Slight improvement in predictions for some traits in real data Strategies for constructing annotation categories impact results [email protected] Biological priors in genomic prediction models 20 / 21
  33. Wrapping up... Take home messages Can genomic prediction models be

    improved by better accounting for our knowledge about the function of certain regions of the genome? [email protected] Biological priors in genomic prediction models 21 / 21
  34. Wrapping up... Take home messages Can genomic prediction models be

    improved by better accounting for our knowledge about the function of certain regions of the genome? Yes, sometimes. [email protected] Biological priors in genomic prediction models 21 / 21
  35. Wrapping up... Take home messages Can genomic prediction models be

    improved by better accounting for our knowledge about the function of certain regions of the genome? Yes, sometimes. Models → BayesRCO for overlapping annotation categories, extensions in progress to handle quantitative annotations Genotyping data → Capitalizing on annotation maps likely requires WGS resolution Validation data → Greater potential gains when prediction is performed on genetically distant populations Traits → Heritability, genetic architecture, link with annotations, ... Annotations → Which molecular assays, in which tissues? [email protected] Biological priors in genomic prediction models 21 / 21
  36. Thank you! Mollandin et al. (2022) Accounting for overlapping annotations

    in genomic prediction models of complex traits, BMC Bioinformatics, 23:65. https://github.com/FAANG/BayesRCO