Data Gutenkunst (2009) PLoS Genet Jouganous (2017) Genetics tracts. The results, plotted in Figure 6A, show no significant difference between the average recombination rate within long IBS tracts versus short ones. If recombination hotspots significantly reduced the frequency of long IBS tracts compared to what we would expect under the assumption of constant recombination rate, then the longest observed IBS tracts should span regions of lower-than-average recombination rate; conversely, if recombina- tion hotspots significantly increased the frequency of short IBS tracts, we would expect to see short tracts concentrated in regions of higher-than-average recombination rate. We observed neither of these patterns and therefore made no special effort to correct for recombination rate variation. Li and Durbin made a similar decision with regard to the PSMC, which can accurately infer past population sizes from data with simulated recombination hotspots. To judge whether non-uniformity of the mutation rate was biasing the IBS tract spectrum, we computed the frequency of human/chimp fixed differences within IBS tracts of length L. We observed that short IBS tracts of v100 bp are concentrated in regions with elevated rates of human-chimp substitution, suggest- ing that mutation rate variation has a significant impact on this part of the IBS tract spectrum. IBS tracts shorter than 5 base pairs long are dispersed fairly evenly throughout the genome, but human-chimp fixed differences cover more than 10% of the sites they span (see Figure 6B) as opposed to 1% of the genome overall. In Hodgkinson, et al.’s study of cryptic human mutation rate variation, they estimated that the rate of coincidence between human and chimp polymorphisms could be explained by 0.1% of sites having a mutation rate that was 33 times the mutation rate at other sites [52]. We modified our method to reflect this correction when analyzing real human data, assuming that a uniformly distributed 0.1% of sites have a scaled mutation rate of h’~0:033, elevated above a baseline value of h~0:001. We also excluded IBS tracts shorter than 100 base pairs from all computed likelihood functions (see Methods for more detail). Human demography and the migration out of Africa conflicting models of human evolution that have been proposed in recent years. Two of these models were obtained from SFS data using the method LaLi of Gutenkunst, et al.; these models are identically parameterized but differ in specific parameter estimates, which were inferred from different datasets. One model was fit to Table 1. Inferring the parameters of a simple admixture scenario. ta (gens) ts (gens) f N True value: 400 2,000 0.05 10,000 Mean: 431 1,990 0.0505 9,806 Std dev: 51 41 0.00652 27 Bias: 31 210 0.0005 2194 Mean squared error: 3280 1781 4:27|10{5 3:84|104 True value: 200 2,000 0.05 10,000 Mean: 220 1,983 0.0499 10,003 Std dev: 28 39 0.00328 287 Bias: 20 217 20.0001 23 Mean squared error: 1184 1810 1:08|10{5 8:23|104 Using MS, we simulated 200 replicates of the admixture scenario depicted in Figure 2B. In 100 replicates, the gene flow occurred 400 generations ago, while in the other 100 replicates it occurred 200 generations ago. Our estimates of the four parameters ta,ts,f ,N are consistently close to the true values, showing that we are able distinguish the two histories by numerically optimizing the likelihood function. doi:10.1371/journal.pgen.1003521.t001 Figure 4. Frequencies of IBS tracts shared between the 1000 Genomes trio parental haplotypes. Each plot records the number of L-base IBS tracts observed per base pair of sequence alignment. The red spectrum records tract frequencies compiled from the entire alignment, while the blue spectra result from 100 repetitions of block bootstrap resampling. A slight upward concavity around 104 base pairs is the signature of the out of Africa bottleneck in Europeans. doi:10.1371/journal.pgen.1003521.g004 Inferring Demography from Shared Haplotype Lengths Harris (2013) PLoS Genet Figure 2. We offer a fast algorithm for sorting m 283 function: 284 https://github.com/flag0010/pop_gen_cnn/blob/m 285 rep.tricks.py). 286 287 Introgression detection 288 To detect introgression, we simulated train 289 (https://github.com/geneva/msmove) from the sa 290 (2018) used to train the FILET classifier for detec 291 and D. sechellia. In total we produced 237,500 co 292 without no migration between species (No Intro 293 simulans into D. sechellia (sim→sech), and 12,500 wit 294 (sech→sim). We used fewer sech→sim examples becau 295 that the network could detect this class fairly ac 296 sampling of the other two more challenging classes 297 training and validation sets so that the training set i 298 Chromosomes 0 10 20 0 20 40 0 20 40 0 Segregating Figure 2: Example population genetic alignments unsorted alignment matrix (left) and this same matrix s (right) are shown. Each row represents one of twenty represents one of forty segregating sites. Derived and a respectively. CC-BY 4.0 Int It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioR http://dx.doi.org/10. doi: bioRxiv preprint first posted online May. 31, 2018; Flagel (2018) bioRxiv Comparison with ABCtoolbox Although ABC is not well suited for our scenario of interest and deep learning is a complem tary method, we wanted to find a scenario where we could compare the performance of thes two methods. To this effect, we restricted the analysis to estimating (continuous) demograp parameters only. We used the popular ABCtoolbox [54], using the same training and testing datasets as for deep learning. For ABC, the training data represents the data simulated unde the prior distributions (uniform in our case), and each test dataset was compared with the training data separately. We retained 5% of the training datasets, and used half of these retai datasets for posterior density estimation. Overall, we used 75% of the datasets for training an 25% for testing. We tested two scenarios, one with the full set of summary statistics (345 total), and the other with a reduced set of summary statistics (100 total). For the reduced set of summary s tistics, we chose statistics which seemed to be informative: the number of segregating sites, Tajima’s D, the first 15 entries of site frequency spectrum, H1, and the distribution of distan Fig 5. A Venn diagram of most informative statistics for each output variable (N1 , N2 , N3 , and selection). For each variable, the top 25 statistics we chosen using permutation testing. The Venn diagram captures statistics common to each subset of output variables, with notable less informative statistic shown in the lower right. Close, mid, and far represent the genomic region where the statistic was calculated. The numbers after each colon refer to the position of the statistic within its distribution or order. For the SFS statistics, it is number of minor alleles. For each region, there are 50 SFS statistics, 16 B statistics (distribution between segregating sites), 30 IBS statistics, and 16 LD statistics. doi:10.1371/journal.pcbi.1004845.g005 Deep Learning for Population Genetic Infere PLOS Computational Biology | DOI:10.1371/journal.pcbi.1004845 March 28, 2016 13 Sheehan (2016) PLoS Comput Biol Schiffels (2014) Nature Genet