ML for Population Genetics

by Shanelle Recheta

Embed

Start on current slide

Slide 1

Slide 1 text

Genetic population assignment of Holothuria scabra using supervised machine learning Carmen | Recheta | Soliven

Slide 2

Slide 2 text

Holothuria scabra ● “Sandfish” ● One of the most extensively studied sea cucumbers ○ Extensive distribution throughout the Indo-Pacific ○ High economic value: “beche-de-mer” ○ Can be mass produced in hatcheries ● IUCN red list classification: endangered ● In the Philippines, trade is regulated by the Department of Agriculture; collecting, trading, or buying of undersized sea cucumbers are prohibited and punishable by law Photo by Ria Tan from singapore [CC BY-SA 2.0 (https://creativecommons.org/licenses/by-sa/2.0)]

Slide 3

Slide 3 text

Population assignment ● One of the most explored concepts in conservation management and wildlife forensics ● Answers the investigative question: “Is this sample from a speciﬁc population or geographic region?” ● Can be used to monitor population speciﬁc exploitation ● Uses multilocus genotypes (e.g. microsatellite data) ● Based on statistical methods ○ Frequency based - most similar allele frequencies ○ Genetic distance based - closest genetic distance ○ Bayesian based - similar patterns of variation (optimizes HWE and loci independence)

Slide 4

Slide 4 text

Population genetics and conservation ● We expand towards population assignment: i.e. using available genetic information, we will assign individual sandﬁsh into their respective source population ○ Implications: conservation management, wildlife forensics ○ Assignment tests may be used to monitor population-speciﬁc exploitation Photo by Brian Jones, courtesy of Blue Ventures

Slide 5

Slide 5 text

Machine learning: a new paradigm in population genetics ● New studies have hinted on the use of supervised machine learning in different ﬁelds in population genetics (Schrider & Kern, 2018) ● Here, we explore if machine learning is applicable in genetic population assignment

Slide 6

Slide 6 text

Objective ● To apply machine learning on microsatellite data for genetic population assignment

Slide 7

Slide 7 text

Methods Data description ● Dataset consists of 565 Holothuria scabra samples from 15 sites and genotyped at 13 microsatellite loci Quality control ● Samples with 30% missing data were removed from the dataset and excluded from further analysis (528 samples remaining)

Slide 8

Slide 8 text

Methods Data analysis (Population structure) ● Genalex version 6.503: allele frequencies and principal coordinates analysis ● Arlequin version 3.5.2.2: F ST calculation and tests for Hardy-Weinberg equilibrium Population genetic assignment ● GeneClass2 version 2.0: traditional population assignment using frequency, genetic distance, and Bayesian based methods

Slide 9

Slide 9 text

Methods Machine learning ● Prior to testing on machine learning algorithms, missing alleles were imputed using several methods ○ Manual imputation using MicroDrop version 1.1 ○ K nearest neighbors (kNN) ○ Missing forest ● In order to check if reducing the loci affects the assignment accuracy, 2 loci (Hsc11 and Hsc31) were removed from analysis ● We also reduced the number of samples in order to see if it will affect accuracy

Slide 10

Slide 10 text

Results and discussion Genetic structure ● Calculated global F ST is 0.026 ● Populations may be grouped into 5 metapopulations based on population structure Population(s) Group 1 STA, SOR, GUI Group 2 MAS, ROM, CON, TIG, CEB, BOH, DUM Group 3 SAM, GEN, TWI Group 4 COR Group 5 ELN

Slide 11

Slide 11 text

Results and discussion Assignment accuracies of traditional methods Frequency Genetic distance (Nei’s DA) Bayesian (unsupervised) 15 populations 0.3598 0.3617 0.3068 5 populations 0.5800 0.6023 0.6190

Slide 12

Slide 12 text

Results and discussion Logistic Regression Decision Tree Support Vector Classifier Naïve Bayes kNN Multilayer Perceptron Random Forest with GridSearchCV TPOT Original dataset 0.1604 0.1604 0.2547 0.0849 0.1792 0.1415 0.2830 0.2830 With manual imputation 0.1604 0.1981 0.1604 0.1038 0.0943 0.1981 0.3302 0.3113 With kNN imputation 0.1792 0.1698 0.1981 0.0849 0.0943 0.1887 0.3019 0.2736 With missing forest imputation 0.1792 0.1887 0.1698 0.0849 0.1038 0.2358 0.2736 0.2170 Removed Hsc11 and Hsc31 0.1509 0.1226 0.1981 0.0755 0.1792 0.1226 0.2358 0.1604* Reduced markers; threshold = 20 0.1250 0.0577 0.1827 0.1154 0.1250 0.0763 0.2019 0.1827* Reduced markers; threshold = 22 0.1250 0.1354 0.1458 0.0938 0.1458 0.0833 0.2604 0.2083* Reduced markers; threshold = 24 0.1029 0.1176 0.1176 0.1176 0.2059 0.0441 0.1765 0.1912* Populations = 15

Slide 13

Slide 13 text

Results and discussion Logistic regression Decision tree Support vector classifier Naïve Bayes kNN Multilayer perceptron Random forest with GridSearch CSV TPOT Original dataset 0.3962 0.3868 0.4434 0.1604 0.3868 0.3962 0.5189 0.5943 With manual imputation 0.3868 0.3868 0.4245 0.3208 0.4443 0.4340 0.5472 0.5849 With kNN imputation 0.4151 0.3491 0.4245 0.3302 0.3962 0.4906 0.5755 0.6226 With missing forest imputation 0.4245 0.4151 0.4245 0.3396 0.4057 0.4623 0.5566 0.5472 Removed Hsc11 and Hsc31 0.3774 0.3962 0.4340 0.2642 0.3879 0.3491 0.5660 0.5660 Populations = 5

Slide 14

Slide 14 text

Results and discussion ● Assignment using traditional statistical methods did not perform significantly better from other studies ○ Larrain et al. (2014): 50% accuracy with F ST = 0.042 ● Larrain et al. also highlighted that better results are found in farmed organisms due to reported differences in allele frequencies between farms as a result of artificial selection and less gene flow ○ Since the sandfish in the study were sampled from natural populations, this may partially explain the low assignment accuracies

Slide 15

Slide 15 text

Results and discussion ● Assignment performance of machine learning algorithms was not better than traditional statistical methods at 15 populations and did not differ signiﬁcantly at 5 populations ● Combining related groups together improved the accuracy for both traditional and machine learning methods ○ Reduced complexity, increased sample size ○ Increased prior likelihood of assigning samples correctly (from 1/15 to 1/5) ● Excluding the traditional methods, the model with highest accuracy was TPOT at 59.43%

Slide 16

Slide 16 text

Conclusion ● Population genetic assignment is an important tool in conservation management and wildlife forensics ● Organisms with very low F ST (less than 0.1) are more challenging to assign ● Holothuria scabra in the Philippines have very low global F ST (0.026) which resulted in poor assignment power using both traditional and machine learning models ● Reducing the complexity of data resulted in better accuracies ● Reducing the number of samples and loci resulted in lower accuracies ● Assignment power may be improved by increasing the number of samples, number of loci, or a combination of both