(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Allele surfing causes maladaptation in a Pacific salmon of conservation concern [1]
['Quentin Rougemont', 'Centre D Ecologie Fonctionnelle Et Evolutive', 'Univ Montpellier', 'Cnrs', 'Ephe', 'Ird', 'Montpellier', 'Thibault Leroy', 'Genphyse', 'Inrae']
Date: 2023-10
Abstract How various factors, including demography, recombination or genome duplication, may impact the efficacy of natural selection and the burden of deleterious mutations, is a central question in evolutionary biology and genetics. In this study, we show that key evolutionary processes, including variations in i) effective population size (N e ) ii) recombination rates and iii) chromosome inheritance, have influenced the genetic load and efficacy of selection in Coho salmon (Oncorhynchus kisutch), a widely distributed salmonid species on the west coast of North America. Using whole genome resequencing data from 14 populations at different migratory distances from their southern glacial refugium, we found evidence supporting gene surfing, wherein reduced N e at the postglacial recolonization front, leads to a decrease in the efficacy of selection and a surf of deleterious alleles in the northernmost populations. Furthermore, our results indicate that recombination rates play a prime role in shaping the load along the genome. Additionally, we identified variation in polyploidy as a contributing factor to within-genome variation of the load. Overall, our results align remarkably well with expectations under the nearly neutral theory of molecular evolution. We discuss the fundamental and applied implications of these findings for evolutionary and conservation genomics.
Author summary Understanding how historical processes, such as past glaciations, may have impacted variations in population size and genetic diversity along the genome is a fundamental question in evolution. In this study, we investigated how recent postglacial demographic expansion has affected the distribution of deleterious genetic variants and the resulting deleterious mutation load in Coho salmon (Oncorhynchus kisutch), throughout its native range in North America. By sequencing the entire genome of 71 Coho salmon, we reveal that postglacial expansion has led to allele surfing, a process where alleles increase in frequency in populations that are expanding or colonizing new environments. Here, allele surfing resulted in an increased deleterious mutation load at the colonization front. Furthermore, we demonstrated that the efficacy of natural selection scales with variation in effective population size among populations. We showed that the specific genomic features of Coho salmon, namely variation in local recombination rate and variation in chromosomal inheritance, strongly impacted the segregation of deleterious mutations.
Citation: Rougemont Q, Leroy T, Rondeau EB, Koop B, Bernatchez L (2023) Allele surfing causes maladaptation in a Pacific salmon of conservation concern. PLoS Genet 19(9): e1010918.
https://doi.org/10.1371/journal.pgen.1010918 Editor: Nicolas Bierne, CNRS UMR5554, FRANCE Received: November 11, 2022; Accepted: August 11, 2023; Published: September 8, 2023 Copyright: © 2023 Rougemont et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: New sequencing data are currently being deposited on NCBI SRA under project PRJNA982773. The coho vcf is available online at 10.5281/zenodo.8027705. All scripts to reproduce our analyses are currently being deposited online at
https://github.com/QuentinRougemont/selection_efficacy. Funding: This research was supported by a Genome Canada strategic grant entitle EPIC4–Enhanced Production in Coho: Culture, Community Catch (grant ID: 229COH, Genome Canada) attributed to L.B. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
Introduction Demographic events, including population contractions, expansions and secondary contacts have profound impacts on the spatial distribution of genetic diversity within species [1]. Range expansions can be accompanied by multiple founder events when few individuals colonize new areas. At the wave front, this process results in increased genetic drift and a reduced effective population size [2, 3]. This has multiple consequences genome-wide including (i) extended linkage disequilibrium [4], (ii) a loss of genetic diversity and (iii) increased levels of genetic differentiation [2, 3, 5]. At the wavefront, increased genetic drift favors allele surfing, a process that will increase the relative proportion of neutral, deleterious or advantageous mutations that will fix in the population [5–7]. Such allele surfing is expected to have two main negative consequences at the wave front: i) an increase of the deleterious load (called expansion load [8, 9]) and ii) a loss of adaptive potential [10]. First, allele surfing through increased genetic drift can overwhelm the effect of weak purifying selection, resulting in an increase of the deleterious load [11]. Due to the lower efficacy of natural selection in low N e populations, a greater fraction of slightly deleterious mutations is expected to be present in populations at the wavefront (i.e., elevated ratios of non-synonymous to synonymous nucleotide diversity (π N /π S ) [8, 9]). Moreover, under the nearly neutral theory of molecular evolution, the reduced efficacy of purifying selection in small populations should increase the fixation rate of slightly deleterious mutations [12]. On the other hand, given that most deleterious mutations are expected to be recessive, increased homozygosity of deleterious mutations should enable their removal more efficiently through genetic purging, leading to a reduction of the recessive and additive load [13]. To take this into account, a possible strategy is to investigate both the additive (i.e,. total) and the recessive (i.e., fixed) deleterious mutation load. [8]. Evidence for an increased recessive genetic load has been reported in humans which has been associated with the Out-of-Africa bottleneck [14, 15]. Evidence has also been gathered in expanding populations in plants [16–20]. For instance in Arabidopsis lyrata, range expansion has been associated with an increased mutational load [16] and increased linkage disequilibrium [21]. Similar results were observed in Escherichia coli [22]. In contrast, a recent study in A. lyrata found no evidence of increased load following range expansion [23]. Second, allele surfing is expected to lead to a loss of adaptive potential due to a reduction in the rate of adaptive substitutions given that the supply of new mutations is proportional to N e [24]. Unless compensatory mutations (i.e., favourable mutations whose probability of occurrence increases to compensate for the fixation of slightly detrimental mutations–themselves with a higher probability of fixation in small populations–), counteract this effect in populations with a low N e [25], one expects a lower proportion of substitutions driven by positive selection [26]. Support for a decreased rate of adaptation associated with range expansion was documented, particularly in plants [17, 20]. For instance, such a pattern was reported in the North American populations of A. lyrata[17] [27], but not in the European ones despite a slight increase in their genomic burden [23]. Similarly an increased load is observed in human populations, a result that matches theoretical expectations under the out-of-Africa scenario [28] but with limited evidence of a decrease in the efficacy of natural selection [29]. Beyond application to plants, humans and bacteria, there is a lack of empirical studies documenting the consequences of range expansion on the deleterious mutation load and the efficacy of natural selection. Geographic variation in the overall mutation load is not the only interesting pattern to investigate; the genomic scale is also important. Both deleterious load and efficacy of selection are expected to vary along the genome depending on the local recombination rate [30, 31]. Non-recombining regions are expected to more freely accumulate deleterious mutations, a process called Muller’s Ratchet [32]. Moreover, Hill-Robertson interference, a process whereby competing alleles interfere with each other to become fixed, increases the fixation rate of deleterious variants linked to positively selected sites. Such a process will be exacerbated in the absence of recombination [30, 33]. In species that have undergone a whole-genome duplication (WGD) with associated residual polyploidy, variation in chromosomal inheritance and genomic architecture may affect the distribution of deleterious mutations along the genome [34]. By doubling the chromosome number, WGD has major consequences. In the short term, WGD alters gene expression and favours polyploid dysfunction (reviewed in [35]). In the middle to long-term, the progressive return to a diploid state will result in regions of residual tetraploidy. Such regions should display an increased effective population size as compared to the rest of the genome (4 N e instead of 2 N e ) and therefore a higher efficacy of selection and a reduced genetic load. Similarly, the population scale recombination rate, which also depends on N e , should favour a higher efficacy of selection and a reduced genetic load. We thus predict that these regions should display a lower genetic load compared to diploid regions of the genome [36]. WGD has been associated with a reduced efficacy of purifying selection which may favour the accumulation of transposable elements. Such WGD may in turn have contrasted effects, as observed in plants [37, 38]. For instance, a LTR insertion favouring early flowering time (an adaptive trait in harsh environments), was shown to be present in tetraploid populations, but not in diploid populations of A. arenosa. In contrast, WGS was shown to be associated with small but significant differences in the load and distribution of fitness effect in the same species. The Coho salmon (Oncorhynchus kisutch) is a Pacific salmon species of major cultural and economical importance, which has severely declined over the last decades (reviewed in [39]). It has undergone further recent declines in multiple parts of its native range [40]. In addition to human-induced perturbations, the species has undergone a series of postglacial expansions from its main glacial refugium (Cascadia/California) with multiple founder events along its route toward Alaska, leading to a pattern of isolation by distance, a gradient of population structure and a decrease in genetic diversity [41–43]. Accordingly, these range expansions may result in an increased expansion load, questioning whether such load may have consequences on the fitness of populations at the expansion front. The Coho salmon offers a unique opportunity to test the role of demography and recombination on the deleterious load and on the efficacy of selection. The Coho salmon belongs to the Salmonidae family, a particularly relevant group of species to investigate the effect of WGD. The common ancestor of all present-day salmonids has undergone a whole-genome duplication event ~80-100MyA [44]. The genome of salmonid species is still on its path to rediploidization. For instance, approximately 8% of the Coho salmon genome displays residual tetraploidy [41]. Residual tetraploidy offers a unique opportunity to investigate the direct effect of the genomic variation of N e on the efficacy of selection. To address these questions, we aimed at testing the following general hypotheses: i) whether demographic expansion and bottlenecks have led to an increased load and decreased efficacy of selection, ii) whether heterogeneous recombination levels shape the within-genome variation of the load as a result of Hill-Robertson effects and iii) whether residual tetraploidy results in an increased efficacy of selection and reduced deleterious load through increased recombination and N e .
Conclusion The role of genetic drift, recombination, selection and variation in inheritance in affecting the deleterious load and efficacy of natural selection is a central question with fundamental and applied consequences for biodiversity. Using population genomics analyses, we investigated the evolutionary consequences of recent demographic events in Coho salmon. Our results supported gene surfing across North America [42]. Results also supported lower N e at the northern range expansion front, which induced two main evolutionary consequences: a surf of slightly deleterious mutations and a putative reduction of the adaptive potential, as expected under the nearly neutral hypothesis of molecular evolution [12]. We further demonstrated that one population, the Salmon River (Thompson R. watershed) displayed an increased fixed load and decreased selection efficacy compared to southern populations. Previous studies showed that this population is genetically isolated from others [42, 43, 63] and displayed genomic footprints of a bottlenecked population such as long runs of homozygosity [41]. This population has been subjected to extensive hatchery enhancement (from a single population) to circumvent the decline of Coho salmon in the Thompson drainage [64]. From a theoretical standpoint, if a few captive individuals were used as parents for subsequent releases, then an increase in inbreeding and a reduction in effective population size are expected, a phenomenon called the Ryman-Laikre effect [57] [65]. Consequently it is possible that the increased π N /π S , ω NA , and fixed deleterious load could be explained by both long-term evolution at low N e and by the Ryman-Laikre effect associated with hatchery enhancements. Regardless, this suggests that careful enhancement needs to be performed with a diversity of parents to maximize genetic diversity of supplemented populations. This result has implications for fisheries enhancement, as well as genetic rescue programs aiming at reducing the inbreeding load of declining populations and restoring the fitness of these populations. It is noteworthy that the southernmost population (Klamath R, California) was not characterized by the highest α value. Several non-exclusive hypotheses may explain this pattern: first, some Californian populations are known to have undergone strong recent (human-driven) decrease in abundance due to both habitat degradation and climate change, which may leave a detectable footprint in the genome [66]. The recent demography could have an impact on the accuracy of the fit of the synonymous and nonsynonymous SFS under Grapes. Second, the Klamath R. population is located in the far upstream part of that river, and may have undergone a founder event when reaching this upper part, a hypothesis that is supported by the correlations we observed between several metrics and distance to the river mouth. Finally, we did not include the southernmost populations from California, which are more likely to be more ancestral, as we previously reported [42]. Since the use of π N /π S ratio can be criticized [14, 67, 68], especially when used to quantify the load within species, we computed additional metrics [56] to more directly estimate mutation load, as advocated by others [14, 69]. In this case, we found small differences in the additive load among Coho salmon populations, but a linear increase in the recessive load as a function of the distance to the southernmost sites and as a function of the tree branch length, both are used as proxies of the expansion route. We demonstrated that the π N /π S ratio was negatively correlated with the GC content at third codon position, which represents a good proxy of the local recombination rate (see S2 Note). The negative correlation between π N /π S and GC3 was repeatedly observed across different salmonid species using both a sequence based estimate of π N /π S and an estimate based on GC-conservative site (non-affected by gBGC) [60]. These results indicated that recombination plays a key role in explaining the variation in the mutation load along the genome in salmonids. Our study empirically supports theoretical work about the accumulation of slightly deleterious mutations in non-recombining regions [30, 32]. Similarly, increased prevalence of deleterious mutations in low recombining regions have been reported in plants [70] and in human populations [71–73]. In particular, recent work in human populations have shown that both variation in demographic history (i.e. change in effective population size) and recombination rates are affecting allele-specific expression of harmful mutations [71]. The authors showed that allele specific expression causes underexpression of harmful mutations more efficiently in normally and highly recombining regions compared to low recombining regions. They further documented variation of this process among populations with varying demographic histories. In line with the key role of recombination rate, we observed that only regions of residual tetraploidy with extremely low recombination rate (lower GC content) displayed an increased load in Coho salmon, accumulating more transposable elements, suggesting efficient selection in more “normally” recombining region of residual tetraploidy. In particular, our results indicate that i) the regions of residual tetraploidy (4N e ) with a normal recombination rate do not display an increased load, which may be expected under the nearly neutral theory if higher N e is associated with the efficacy of selection; ii) only region of extremely low recombination rate (lower GC content) displays the highest load, highlighting once again the primary role of recombination in shaping the load, with a higher contribution than N e . A third factor is the variation in dominance of deleterious mutation. Indeed, a recent simulation study found that in the case of hard sweeps, dominance of recessive mutation was a central aspect determining the signal of sweeps in polyploid genomes [39]. A detailed investigation of this aspect in regions of residual tetraploidy was beyond the scope of our study but would be worthy of further investigation. The detailed consequences of ongoing rediploidization have been extensively studied at the regulatory levels elsewhere, with support for an increased load (higher d N /d S and TE load) in duplicated genes undergoing lower expression [74]. Recent studies in polyploid plant species have also documented various evolutionary consequences of such duplication on local variation in the load and efficacy of selection [35]. For instance in A. arenosa, a reduced efficacy of purifying selection was suggested in tetraploids (4X) genome compared to diploids (2X), and the authors suggest that this is because deleterious alleles are better masked in autotetraploids [38]. Similarly, a recent work in the allopolyploid cotton (Gossypium) demonstrated that this species accumulates more deleterious mutations than the diploid species [75]. This observation supports a theory proposed by Haldane that recessive deleterious mutations accumulate faster in allopolyploids because of the masking effect of duplicated genes [76]. To sum up, our results concur with predictions from the nearly neutral theory of molecular evolution [12], in which slightly deleterious mutations are effectively neutral and purged effectively in regions of higher recombination, except perhaps in populations at the extreme of the expansion front. Interestingly, these results indicate that regions of low-recombination re-diploidize later than other genomic regions. We are not aware of any paper having documented such patterns so far. Further studies of this process across more species would be welcome to validate the generality of this observation. Finally, our results have implications for conservation practices. We showed that the additive load is approximately constant, indicative of efficient purging across populations for both missense and LoF, similar to some recent studies on several plant and animal models [77–80]. However, results indicate that population at higher latitude from the Yukon watershed (Mile Slough R., Porcupine R.) or from the bottlenecked Salmon R. have not entirely purged the most deleterious mutations, including missense and LoF mutations, which may impose a fitness cost to these populations. Further empirical evidence for a causal link between the putative fitness cost of LoF mutations and adaptive phenotypic variation will be necessary to validate our observations. With this caveat in mind, our results for the Salmon R could guide practices in supplementation programs. For instance, choosing a diversity of parents from moderately differentiated populations of modest size may help increase the levels of heterozygosity and mask the expression of recessive deleterious mutations [55]. This strategy could reduce the occurrence of deleterious alleles and counteract the Ryman-Laikre effect described above. Moreover, populations (e.g. most upstream Alaskan/Yukon populations) for which our results suggest a reduced adaptive potential are also the most strongly exposed to rapid climate change, as the rate of temperature increase is most rapid at higher latitudes [81]. In all cases, maximizing the connectivity among populations and limiting habitat degradation appears as fundamental strategies to maintain high effective population size and increase the adaptive potential of Coho salmon to the multiple ongoing anthropogenic pressures [82]. More generally, how best to manage declining populations and guide conservation policies in these conditions to minimize the load and/or maximize genetic diversity is another debated issue [55, 83–85]. In the meantime, while conservation genomics undoubtedly has a major role for the short-term preservation of endangered species, this should not override the crucial need for reducing human impacts on natural ecosystems to preserve biodiversity over long time scales [86].
Methods Sampling design for Coho salmon We sampled and sequenced 71 individuals representing 14 populations distributed from California to Alaska (S1 Table). A set of 55 individuals was sequenced on an Illumina HiSeq2500 platform [38] and the other 16 individuals were sequenced on a NovaSeq6000 S4 platform using paired-end 150 bp reads. Reads were processed using fastp for trimming [87], and mapped to the most recent Coho reference genome (
https://www.ncbi.nlm.nih.gov/assembly/GCF_002021735.2/) using bwa mem v2 [88]. Reads with a minimum quality below 20 were discarded with samtools v1.7. Duplicates were removed with picard (
http://broadinstitute.github.io/picard/). SNP calling was performed using GATK v4.2 [89] using our pipeline available at github.com/quentinrougemont/gatk_haplotype/. We generated a Haplotype gVCF for each sample individually, combined all gVCF and then performed a joint genotyping. We checked the variants quality score of our data and filtered our genotypes according to their quality following GATK best practices and based on quantiles distributions of quality metrics. We excluded all sites that did not match the following criterion: MQ < 30, QD < 2, FS > 60, MQRankSum < -20, ReadPosRankSum < 10, ReadPosRankSum > 10. We also excluded multiallelic SNPs, as well as indels. Genotypes with a depth lower than 6 or higher than 100 reads were also excluded to remove low confidence genotypes potentially associated with paralogs. Finally, we also generated a separate vcf file using the—all-sites option to obtain a file with invariant position to reconstruct sequence data (see the Genetic load estimation in Coho salmon and related species section). A total of 14,701,439 SNPs were identified without missing data. Population structure was evaluated using a principal component analysis (PCA) performed using Ade4 [90] on a set of LD-pruned SNPs without missing data (1,739,037 SNPs) identified stringently with plink1.9 [91] (command indep-pairwise 100 50 0.1). Outgroup dataset. In order to test the generality of the relationship observed between π N /π S and GC3 we took advantage of the newly assembled reference genomes and resequencing data from other closely related Pacific salmonid species with similar demographic histories [92–94]. Sockeye salmon published by [92] were retrieved from NCBI PRJNA530256. Samples with a high number of individuals per ecotype (Sockeye and Kokanee) were chosen from the NCBI table. We retained a total of 5 Kokanee populations and 5 Sockeye populations (from Fraser & Columbia watershed described in S1 and S3 Tables). Three Chinook salmon samples were provided by B. Koop (also available at NCBI PRJNA694998). Additionally, 11 samples of “even” and 10 samples of “odd” pink salmon (O. gorbuscha) were downloaded from PRJNA 556728 and included in the analysis. Here “even” and “odd” refers to Salmon returning to their natal rivers in “even” and “odd” years to spawn, leading to a temporal isolation of these ecotypes [93]. These were indeed clearly separated based on a PCA. Finally, a number of rainbow trout available from NCBI PRJNA386519 were used (n = 19 from 3 random populations showing genomic differentiation based on a PCA). Each sample was downloaded and mapped onto its species’ reference genome downloaded from NCBI and using the exact same procedure as described above relying on fastp, bwa-mem2, picard and GATK 4.2.5.0 to generate a final vcf filter based on usual GATK quality criteria and variance in sequencing depth. For each species, we then quantified the load using the π N /π s ratio with the procedure described below for Coho salmon. For the Sockeye/Kokanee ecotypes, the Sockeye is a fully migratory ecotype, whereas the Kokanee is a resident (non-migratory) ecotype that typically comprises more isolated populations. The two alternative ecotypes are sometimes found in similar locations, with three rivers from our sampling design including both ecotypes (S3 Table). For all other species we tested the relationship between π N /π S and GC3 (see below). Ancestral and derived alleles identification To accurately recover the ancestral allelic states, we used three outgroup species; including the chinook salmon and rainbow trout sample (see above, n = 5 for rainbow trout) plus data from Atlantic salmon (n = 5, SRP059652). Each individual was aligned against the Coho salmon V2 genome (GCF_002021745.2) using GATK with the same procedure as above and calling every SNP using the–all-site mode to obtain invariant positions. We then determined the ancestral state of each SNP if 1) the SNP was homozygous in at least 90% of the individuals from our three outgroups, and 2) matched one of the two alleles identified in Coho salmon. Otherwise, the site was inferred as missing and was not used in subsequent analyses of the load. In addition, we reconstructed a consensus ancestral fasta sequence using the doFasta option from angsd [95]. This was used for demographic reconstruction detailed in S1 Note. Demographic reconstruction We first tested our prediction that genetic diversity (observed heterozygosity H o ) decreases towards the North following our «out of Cascadia» model previously inferred [42, 43]. Conversely, we verified that the β ST coefficient, a measure of both genetic differentiation and ancestrality [46] increases northwards the North, as expected due to isolation by distance. β ST and observed heterozygosity (H o ) were measured using the hierfstat R package [96] Details about β ST computation are provided in Weir & Goudet [46] and H o was computed following Nei (1987)[97] as 1 - ∑ k ∑ i Pkii/np with Pkii the proportion of homozygotes i in sample k and np the number of samples. Oceanic coastal distances were computed using the marmap package [98] and waterway distance was computed using ArcGIS. Then we summed the 2 distances to obtain the distance to the most likely refugia that we identified in our previous studies [42, 43]. A shapefile of the rivers used here is available at
https://github.com/QuentinRougemont/selection_efficacy/tree/main/00.raw_data. Under postglacial expansion from a single refugium, the general hypothesis would be that all sampled populations should follow a common temporal trajectory of a population decline (bottleneck due to founder events by few individuals) followed by a (strong) increase in N e corresponding to the expansion phase. To test this hypothesis, we inferred temporal changes in N e using SMC++ [50]. SMC++ works similarly to the PSMC model but takes into account the Site Frequency Spectrum (SFS) and is better suited to large sample sizes. Estimates of changes in population size were performed for all populations. To validate the fact that expansions are indeed postglacial, splitting time was estimated between all pairs of samples from different geographic areas based on the joint SFS (n = 75 pairwise comparisons). A generation time of 3.5 years, well documented for Coho, and a mutation rate of 8e-9 mutation/bp/generation were applied (corresponding to the median substitution rate inferred in Atlantic salmon, J. Wang, Personal communication). We also compared the results to the mean substitution rate of 1.25e-8 mutation/bp/generation also inferred by Wang. In addition, pairwise linkage disequilibrium provides valuable information regarding population size and inbreeding. We computed the squared correlation coefficient between genotypes in each sample and all populations separately using popLDdecay [99]. We used a MAF of 5% in each population, keeping between 3.7 and 6.4 million SNPs with populations having undergone stronger bottleneck/founding events displaying the lowest amount of variation. We estimated LD decay by plotting LD (R2) against physical distance measured in base pairs in ggplot2 [100] package in R. We observed slight discrepancies between our SMC++ estimates of divergence time and our previous work based on the site frequency spectrum [42]. To investigate this, we performed a new set of inference based on the unfolded joint site frequency spectrum (jSFS) using ∂a∂i [101] and a new set of refined models as detailed in supp. S1 Note and S8 and S9 Tables. To reconstruct the population expansion routes we computed Weir and Cockerham [102] F ST values among populations using Vcftools [103] and constructed a phylogenetic population tree using the R package ape [104]. We projected the tree on the map using the functions contMap and phylo.to.map from the phytools R package [105]. We then computed the tree branch lengths to the root using the function distroot from adephylo [106]. We also explored broad relationships among populations with treemix [49]. We fitted a model with an increasing number of migration edges. We chose a model with K = 3 migration edges as all edges displayed significant p-value and captured a high proportion of explained variance. Fitting more edges decreased the p-value without really improving the fit to our data. Similarly, we computed the tree branch length to the root using the function adephylo and compared our results to those from F ST -based values. The resulting tree branch length values were used as a proxy for the expansion routes, and we tested their correlation using linear models in R, with all our metrics of deleterious load (π N /π S , total number of putative homozygous derived deleterious alleles (recessive load) and total number of deleterious alleles (both in homozygous and heterozygous states, additive load) for derived missense and LoF mutations) and metrics of selection efficacy (ω NA , ω A , α detailed below). Similar inference was performed, but using the reconstructed distances to the southernmost site included in our previous studies [42, 43]. This distance was computed by summing the oceanographic distance and river distances described above. Population-scaled recombination rate Statistical phasing of the Coho whole genome sequences was performed using the Shapeit software [107], considering all individuals at once. We then estimated effective recombination rates (ρ = 4.N e .r where r represents the recombination rate per generation and N e is the effective population size) along the genome using LDHat [108]. Phased genotypes were converted into LDHat format using vcftools after excluding SNPs with MAF < 10% since rare variants are not informative for such inferences. Following the guidelines of the LDHat manual, the genome was split into fragments of 2,000 SNPs with overlapping windows of 500 SNPs. We measured recombination rates independently for each population. Differences in the distribution of population-scaled recombination was visualized using violin plot (S13 Fig) and statistically tested using Tuckey HSD tests in R (S5 Table). Genetic load estimation π N /π S estimates. The approach developed in [59] was used to reconstruct fasta sequences for each individual. For each species (Coho and the other salmonids), coding sequences (CDS) from each reconstructed sequence were extracted using the gff files available with the reference genome to estimate the nucleotide diversity (π). We also concatenated the CDS sequences into different classes according to their length and computed N and π S over 4-Mb concatenated gene space windows. Such large windows reduce the stochasticity due to the low π S values in Coho salmon. Identifying potential deleterious non-synonymous alleles We tested the difference in count of non-synonymous mutations in each local population of Coho salmon across non-synonymous missense mutations (putatively deleterious) and Loss of Function (LoF) mutations (likely to be strongly deleterious) identified with SNPeff. We analyzed data in two ways: first, we counted the total number of putative homozygous derived deleterious alleles (recessive load) per individual as well as the total number of deleterious alleles (both in homozygous and heterozygous states, additive load) using: Ntotal = 2 Χ N homo + N hetero [28]. These individual values were then averaged per population. We tested for differences in the distribution of these counts among populations using an ANOVA followed by a TukeyHSD test. The p-values were corrected using a Bonferroni correction. We also computed mean derived allele frequencies (DAF) in all sampling locations and across mutation categories (synonymous, non-synonymous missense and LoF). We also applied the commonly used Provean [109] software based on a random set of non-synonymous mutations (but see S3 Note for a brief discussion regarding the limitations).These results were then compared with results from non-synonymous mutations (S15 Fig). Correlation between π N /π S and recombination We computed the GC content at third-codon positions of protein coding genes (GC3) which has been shown to be an accurate proxy of local recombination rate in other species and these positions are generally silent [85, 86]. To compute π N /π S values we sorted genes by ascending GC3 values, which enabled us to obtain a ratio based on genes with similar GC3 values. Moreover, we also used the site frequency-based approach proposed by Rousselle et al. [60] to estimate the π N /π S ratios. This approach enabled us to compute SFS separately for GC conservative sites (A<->T and C<->G mutations), that is, not affected by GC-biased Gene Conversion (gBGC). Finally, we measured the correlation between GC3 and π N /π S using linear models. We replicated these analyses considering only genes (n = 3,500) and SNPs in regions of residual tetraploidy (8% of the genome). DFE estimation and rate of adaptation in Coho salmon We estimated the rate of non-adaptive and adaptive synonymous substitutions (ω NA and ω A , respectively; with ω A = d N /d S —ω NA ) and the proportion of amino-acid substitution potentially resulting from positive selection (α = ω A /(d N /d S ), with d N being the rate of non-synonymous substitutions, d S being the rate of synonymous substitutions). To do so, we used the method implemented in Grapes v1.0 which builds upon the approach of [110]. Grapes models the effect of favorable mutations on the non-synonymous site frequency spectrum (SFS) while accounting for various confounding factors distorting the SFS (demographic change, linked selection, genotyping errors, SNP misorientation). We used the following parameters in Grapes v1.0: we assumed a negative Gamma distribution to the synonymous and non-synoymous SFS (parameters GammaExpo in Grapes and parameter “unfolded” site frequency spectrum). To obtain suitable data, we converted the quality filtered whole genome vcf file into a fasta file for each population containing sequence information for each individual in the population (using vcf2fasta available at
https://github.com/QuentinRougemont/selection_efficacy/tree/main/08.piNpiS/). This file was then converted into a site-frequency spectrum using bppml [111]. We required at least 10 sites without gap to keep a given site (-gapN_site), with a maximum proportion of gap of 0.5 (-gapN_seq), a minimum number of complete codon of 6 (-min_nb_codon), we did not allow frameshift (remove_frameshift parameter), we also required a sample size of 8 (-sample_size), the number of initial/terminal position in which Stop codons or FrameShift codons are tolerated was set to 20 (-tolerance_zone), we did not allow Stop or FrameShift codons between the two tolerance zones (-allow_internal_bc parameter). The kappa value (i.e Ts/Tv ratio) was set to 1.6 and we used an unfolded SFS using the three species as an outgroup. Fitted parameters of the DFE were used to compute the expected d N /d S under near neutrality, which was compared to the observed d N /d S to estimate α, ω NA, ωa. Differences in load for region of residual tetraploidy in Coho salmon π N /π S comparison. In the Coho reference genome, Chromosomes 1 to 30 are considered to represent generally diploid chromosomes, whereas chromosomes 31 to 38 represent those with a clear signal of residual tetraploid [41]. We took advantage of this specificity regarding chromosome evolution to contrast the load for the 3700 genes in regions of residual tetraploidy (averaged across all genes for each population). For diploid regions, we generated 200 datasets of 4,000 genes randomly sampled and then estimated the load for each of these datasets. TE annotations We used the TEs annotation file from repeatmasker [112] (made available on NCBI for the reference genome [41]) and tested for difference in the length of TEs between diploid and tetraploid regions, after correcting for the difference in chromosome length.
Acknowledgments We are grateful to Alysse Perreault-Payette and Bérénice Bougas for help in DNA extraction of additional individuals for whole genome sequencing. We thank Benoit Nabholz for the initial development of the π N /π S C++ app we used here. We are grateful to the genomic platform at Compute Canada (Canada), IBIS (Canada), and MBB (Montpellier, France) for providing computational capacity as well as Bird (Nantes, France) for data storage. This research was carried out in conjunction with EPIC4 (Enhanced Production in Coho: Culture, Community, Catch), a project supported in part by University Laval, Ressources Aquatiques Québec (RAQ), and the Government of Canada through Genome Canada, Genome British Columbia, and Genome Québec.
[END]
---
[1] Url:
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010918
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/