ML for Population Genetics

Genetic population assignment of Holothuria scabra using supervised machine learning
Carmen | Recheta | Soliven

Holothuria scabra • “Sandfish” • One of the most extensively
studied sea cucumbers ◦ Extensive distribution throughout the Indo-Pacific ◦ High economic value: “beche-de-mer” ◦ Can be mass produced in hatcheries • IUCN red list classification: endangered • In the Philippines, trade is regulated by the Department of Agriculture; collecting, trading, or buying of undersized sea cucumbers are prohibited and punishable by law Photo by Ria Tan from singapore [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)]

Population assignment • One of the most explored concepts in
conservation management and wildlife forensics • Answers the investigative question: “Is this sample from a speciﬁc population or geographic region?” • Can be used to monitor population speciﬁc exploitation • Uses multilocus genotypes (e.g. microsatellite data) • Based on statistical methods ◦ Frequency based - most similar allele frequencies ◦ Genetic distance based - closest genetic distance ◦ Bayesian based - similar patterns of variation (optimizes HWE and loci independence)

Population genetics and conservation • We expand towards population assignment:
i.e. using available genetic information, we will assign individual sandﬁsh into their respective source population ◦ Implications: conservation management, wildlife forensics ◦ Assignment tests may be used to monitor population-speciﬁc exploitation Photo by Brian Jones, courtesy of Blue Ventures

Machine learning: a new paradigm in population genetics • New
studies have hinted on the use of supervised machine learning in different ﬁelds in population genetics (Schrider & Kern, 2018) • Here, we explore if machine learning is applicable in genetic population assignment

Objective • To apply machine learning on microsatellite data for
genetic population assignment

Methods Data description • Dataset consists of 565 Holothuria scabra
samples from 15 sites and genotyped at 13 microsatellite loci Quality control • Samples with 30% missing data were removed from the dataset and excluded from further analysis (528 samples remaining)

Methods Data analysis (Population structure) • Genalex version 6.503: allele
frequencies and principal coordinates analysis • Arlequin version 3.5.2.2: F ST calculation and tests for Hardy-Weinberg equilibrium Population genetic assignment • GeneClass2 version 2.0: traditional population assignment using frequency, genetic distance, and Bayesian based methods

Methods Machine learning • Prior to testing on machine learning
algorithms, missing alleles were imputed using several methods ◦ Manual imputation using MicroDrop version 1.1 ◦ K nearest neighbors (kNN) ◦ Missing forest • In order to check if reducing the loci affects the assignment accuracy, 2 loci (Hsc11 and Hsc31) were removed from analysis • We also reduced the number of samples in order to see if it will affect accuracy

Results and discussion Genetic structure • Calculated global F ST
is 0.026 • Populations may be grouped into 5 metapopulations based on population structure Population(s) Group 1 STA, SOR, GUI Group 2 MAS, ROM, CON, TIG, CEB, BOH, DUM Group 3 SAM, GEN, TWI Group 4 COR Group 5 ELN

Results and discussion Assignment accuracies of traditional methods Frequency Genetic
distance (Nei’s DA) Bayesian (unsupervised) 15 populations 0.3598 0.3617 0.3068 5 populations 0.5800 0.6023 0.6190

Results and discussion Logistic Regression Decision Tree Support Vector Classifier
Naïve Bayes kNN Multilayer Perceptron Random Forest with GridSearchCV TPOT Original dataset 0.1604 0.1604 0.2547 0.0849 0.1792 0.1415 0.2830 0.2830 With manual imputation 0.1604 0.1981 0.1604 0.1038 0.0943 0.1981 0.3302 0.3113 With kNN imputation 0.1792 0.1698 0.1981 0.0849 0.0943 0.1887 0.3019 0.2736 With missing forest imputation 0.1792 0.1887 0.1698 0.0849 0.1038 0.2358 0.2736 0.2170 Removed Hsc11 and Hsc31 0.1509 0.1226 0.1981 0.0755 0.1792 0.1226 0.2358 0.1604* Reduced markers; threshold = 20 0.1250 0.0577 0.1827 0.1154 0.1250 0.0763 0.2019 0.1827* Reduced markers; threshold = 22 0.1250 0.1354 0.1458 0.0938 0.1458 0.0833 0.2604 0.2083* Reduced markers; threshold = 24 0.1029 0.1176 0.1176 0.1176 0.2059 0.0441 0.1765 0.1912* Populations = 15

Results and discussion Logistic regression Decision tree Support vector classifier
Naïve Bayes kNN Multilayer perceptron Random forest with GridSearch CSV TPOT Original dataset 0.3962 0.3868 0.4434 0.1604 0.3868 0.3962 0.5189 0.5943 With manual imputation 0.3868 0.3868 0.4245 0.3208 0.4443 0.4340 0.5472 0.5849 With kNN imputation 0.4151 0.3491 0.4245 0.3302 0.3962 0.4906 0.5755 0.6226 With missing forest imputation 0.4245 0.4151 0.4245 0.3396 0.4057 0.4623 0.5566 0.5472 Removed Hsc11 and Hsc31 0.3774 0.3962 0.4340 0.2642 0.3879 0.3491 0.5660 0.5660 Populations = 5

Results and discussion • Assignment using traditional statistical methods did
not perform significantly better from other studies ◦ Larrain et al. (2014): 50% accuracy with F ST = 0.042 • Larrain et al. also highlighted that better results are found in farmed organisms due to reported differences in allele frequencies between farms as a result of artificial selection and less gene flow ◦ Since the sandfish in the study were sampled from natural populations, this may partially explain the low assignment accuracies

Results and discussion • Assignment performance of machine learning algorithms
was not better than traditional statistical methods at 15 populations and did not differ signiﬁcantly at 5 populations • Combining related groups together improved the accuracy for both traditional and machine learning methods ◦ Reduced complexity, increased sample size ◦ Increased prior likelihood of assigning samples correctly (from 1/15 to 1/5) • Excluding the traditional methods, the model with highest accuracy was TPOT at 59.43%

Conclusion • Population genetic assignment is an important tool in
conservation management and wildlife forensics • Organisms with very low F ST (less than 0.1) are more challenging to assign • Holothuria scabra in the Philippines have very low global F ST (0.026) which resulted in poor assignment power using both traditional and machine learning models • Reducing the complexity of data resulted in better accuracies • Reducing the number of samples and loci resulted in lower accuracies • Assignment power may be improved by increasing the number of samples, number of loci, or a combination of both

Don’t live in the Stone Age. The future is AI

Thank you

ML for Population Genetics

ML for Population Genetics

Shanelle Recheta

More Decks by Shanelle Recheta

Other Decks in Research

Featured

Transcript

Genetic population assignment of Holothuria scabra using supervised machine learning

Holothuria scabra • “Sandﬁsh” • One of the most extensively

Population assignment • One of the most explored concepts in

Population genetics and conservation • We expand towards population assignment:

Machine learning: a new paradigm in population genetics • New

Objective • To apply machine learning on microsatellite data for

Methods Data description • Dataset consists of 565 Holothuria scabra

Methods Data analysis (Population structure) • Genalex version 6.503: allele

Methods Machine learning • Prior to testing on machine learning

Results and discussion Genetic structure • Calculated global F ST

Results and discussion Assignment accuracies of traditional methods Frequency Genetic

Results and discussion Logistic Regression Decision Tree Support Vector Classifier

Results and discussion Logistic regression Decision tree Support vector classifier

Results and discussion • Assignment using traditional statistical methods did

Results and discussion • Assignment performance of machine learning algorithms

Conclusion • Population genetic assignment is an important tool in

Don’t live in the Stone Age. The future is AI

Thank you