Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML for Population Genetics

ML for Population Genetics

We built an app that could predict the geographic origin of a sea cucumber sample using machine learning. The app could have a huge impact on the mission towards eradicating wildlife exploitation through forensic science.

Shanelle Recheta

March 18, 2021
Tweet

More Decks by Shanelle Recheta

Other Decks in Research

Transcript

  1. Holothuria scabra • “Sandfish” • One of the most extensively

    studied sea cucumbers ◦ Extensive distribution throughout the Indo-Pacific ◦ High economic value: “beche-de-mer” ◦ Can be mass produced in hatcheries • IUCN red list classification: endangered • In the Philippines, trade is regulated by the Department of Agriculture; collecting, trading, or buying of undersized sea cucumbers are prohibited and punishable by law Photo by Ria Tan from singapore [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)]
  2. Population assignment • One of the most explored concepts in

    conservation management and wildlife forensics • Answers the investigative question: “Is this sample from a specific population or geographic region?” • Can be used to monitor population specific exploitation • Uses multilocus genotypes (e.g. microsatellite data) • Based on statistical methods ◦ Frequency based - most similar allele frequencies ◦ Genetic distance based - closest genetic distance ◦ Bayesian based - similar patterns of variation (optimizes HWE and loci independence)
  3. Population genetics and conservation • We expand towards population assignment:

    i.e. using available genetic information, we will assign individual sandfish into their respective source population ◦ Implications: conservation management, wildlife forensics ◦ Assignment tests may be used to monitor population-specific exploitation Photo by Brian Jones, courtesy of Blue Ventures
  4. Machine learning: a new paradigm in population genetics • New

    studies have hinted on the use of supervised machine learning in different fields in population genetics (Schrider & Kern, 2018) • Here, we explore if machine learning is applicable in genetic population assignment
  5. Methods Data description • Dataset consists of 565 Holothuria scabra

    samples from 15 sites and genotyped at 13 microsatellite loci Quality control • Samples with 30% missing data were removed from the dataset and excluded from further analysis (528 samples remaining)
  6. Methods Data analysis (Population structure) • Genalex version 6.503: allele

    frequencies and principal coordinates analysis • Arlequin version 3.5.2.2: F ST calculation and tests for Hardy-Weinberg equilibrium Population genetic assignment • GeneClass2 version 2.0: traditional population assignment using frequency, genetic distance, and Bayesian based methods
  7. Methods Machine learning • Prior to testing on machine learning

    algorithms, missing alleles were imputed using several methods ◦ Manual imputation using MicroDrop version 1.1 ◦ K nearest neighbors (kNN) ◦ Missing forest • In order to check if reducing the loci affects the assignment accuracy, 2 loci (Hsc11 and Hsc31) were removed from analysis • We also reduced the number of samples in order to see if it will affect accuracy
  8. Results and discussion Genetic structure • Calculated global F ST

    is 0.026 • Populations may be grouped into 5 metapopulations based on population structure Population(s) Group 1 STA, SOR, GUI Group 2 MAS, ROM, CON, TIG, CEB, BOH, DUM Group 3 SAM, GEN, TWI Group 4 COR Group 5 ELN
  9. Results and discussion Assignment accuracies of traditional methods Frequency Genetic

    distance (Nei’s DA) Bayesian (unsupervised) 15 populations 0.3598 0.3617 0.3068 5 populations 0.5800 0.6023 0.6190
  10. Results and discussion Logistic Regression Decision Tree Support Vector Classifier

    Naïve Bayes kNN Multilayer Perceptron Random Forest with GridSearchCV TPOT Original dataset 0.1604 0.1604 0.2547 0.0849 0.1792 0.1415 0.2830 0.2830 With manual imputation 0.1604 0.1981 0.1604 0.1038 0.0943 0.1981 0.3302 0.3113 With kNN imputation 0.1792 0.1698 0.1981 0.0849 0.0943 0.1887 0.3019 0.2736 With missing forest imputation 0.1792 0.1887 0.1698 0.0849 0.1038 0.2358 0.2736 0.2170 Removed Hsc11 and Hsc31 0.1509 0.1226 0.1981 0.0755 0.1792 0.1226 0.2358 0.1604* Reduced markers; threshold = 20 0.1250 0.0577 0.1827 0.1154 0.1250 0.0763 0.2019 0.1827* Reduced markers; threshold = 22 0.1250 0.1354 0.1458 0.0938 0.1458 0.0833 0.2604 0.2083* Reduced markers; threshold = 24 0.1029 0.1176 0.1176 0.1176 0.2059 0.0441 0.1765 0.1912* Populations = 15
  11. Results and discussion Logistic regression Decision tree Support vector classifier

    Naïve Bayes kNN Multilayer perceptron Random forest with GridSearch CSV TPOT Original dataset 0.3962 0.3868 0.4434 0.1604 0.3868 0.3962 0.5189 0.5943 With manual imputation 0.3868 0.3868 0.4245 0.3208 0.4443 0.4340 0.5472 0.5849 With kNN imputation 0.4151 0.3491 0.4245 0.3302 0.3962 0.4906 0.5755 0.6226 With missing forest imputation 0.4245 0.4151 0.4245 0.3396 0.4057 0.4623 0.5566 0.5472 Removed Hsc11 and Hsc31 0.3774 0.3962 0.4340 0.2642 0.3879 0.3491 0.5660 0.5660 Populations = 5
  12. Results and discussion • Assignment using traditional statistical methods did

    not perform significantly better from other studies ◦ Larrain et al. (2014): 50% accuracy with F ST = 0.042 • Larrain et al. also highlighted that better results are found in farmed organisms due to reported differences in allele frequencies between farms as a result of artificial selection and less gene flow ◦ Since the sandfish in the study were sampled from natural populations, this may partially explain the low assignment accuracies
  13. Results and discussion • Assignment performance of machine learning algorithms

    was not better than traditional statistical methods at 15 populations and did not differ significantly at 5 populations • Combining related groups together improved the accuracy for both traditional and machine learning methods ◦ Reduced complexity, increased sample size ◦ Increased prior likelihood of assigning samples correctly (from 1/15 to 1/5) • Excluding the traditional methods, the model with highest accuracy was TPOT at 59.43%
  14. Conclusion • Population genetic assignment is an important tool in

    conservation management and wildlife forensics • Organisms with very low F ST (less than 0.1) are more challenging to assign • Holothuria scabra in the Philippines have very low global F ST (0.026) which resulted in poor assignment power using both traditional and machine learning models • Reducing the complexity of data resulted in better accuracies • Reducing the number of samples and loci resulted in lower accuracies • Assignment power may be improved by increasing the number of samples, number of loci, or a combination of both