(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Major sex differences in allele frequencies for X chromosomal variants in both the 1000 Genomes Project and gnomAD [1]

['Zhong Wang', 'Department Of Statistics', 'Data Science', 'Faculty Of Science', 'National University Of Singapore', 'Lei Sun', 'Department Of Statistic Sciences', 'Faculty Of Arts', 'Science', 'University Of Toronto']

Date: 2022-07

An unexpectedly high proportion of SNPs on the X chromosome in the 1000 Genomes Project phase 3 data were identified with significant sex differences in minor allele frequencies (sdMAF). sdMAF persisted for many of these SNPs in the recently released high coverage whole genome sequence of the 1000 Genomes Project that was aligned to GRCh38, and it was consistent between the five super-populations. Among the 245,825 common (MAF>5%) biallelic X-chromosomal SNPs in the phase 3 data presumed to be of high quality, 2,039 have genome-wide significant sdMAF (p-value <5e-8). sdMAF varied by location: non-pseudo-autosomal region (NPR) = 0.83%, pseudo-autosomal regions (PAR1) = 0.29%, PAR2 = 13.1%, and X-transposed region (XTR)/PAR3 = 0.85% of SNPs had sdMAF, and they were clustered at the NPR-PAR boundaries, among others. sdMAF at the NPR-PAR boundaries are biologically expected due to sex-linkage, but have generally been ignored in association studies. For comparison, similar analyses found only 6, 1 and 0 SNPs with significant sdMAF on chromosomes 1, 7 and 22, respectively. Similar sdMAF results for the X chromosome were obtained from the high coverage whole genome sequence data from gnomAD V 3.1.2 for both the non-Finnish European and African/African American samples. Future X chromosome analyses need to take sdMAF into account.

The human X chromosome contains over 800 genes and is the 8 th largest human chromosome. Genome-wide associations studies have generally failed to examine variants on the X chromosome for association with diseases and traits, partly due to complexities of the data analysis, and challenges with genotype imputation. We examined X chromosomal variants from the 1000 Genomes Project for sex differences in allele frequency and found that many variants showed significant differences. These variants cluster at the centromeric parts of the pseudoautosomal regions 1 and 2, as well as the putative pseudo-autosomal region 3 (also termed X-transposed region). This pattern was observed in high coverage whole genome sequence data from the same subjects that was aligned to GRCh38, suggesting that is not an artefact of low coverage sequencing or problems specific to GRCh37. In addition, we replicated this phenomenon in high coverage whole genome sequence aligned to GRCh38 from the gnomAD database in both the non-Finnish European and African/African American samples. These findings have implications for the analysis of X chromosomal variants for disease and trait associations.

Funding: LS received the awards: This research is funded by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-04934) and the Canadian Institutes of Health Research (CIHR, PJT-180460). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Wang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

We hypothesize that sex difference in MAF exists across the X chromosome in human populations, including NPR and at the boundaries between NPR and PAR1, PAR2 and PAR3, and it is more prevalent in low-coverage whole genome sequence data. We use publicly available phase 3 data from the 1000 Genomes Project to test for sdMAF across and within each of the five super-populations. We compare the results with the recently released high coverage whole genome sequence to determine whether genotyping error is contributory. Finally, we examined genotype data from the high coverage whole genome sequence gnomAD v3.1.2 resource to further evaluate sdMAF on the X chromosome in two defined populations with the largest sample sizes.

Recently, sex-differences in allele frequency in PAR1 and PAR2 were described using the African super-population from the phase 3 data of the 1000 Genomes Project [ 23 ], but the rest of the X chromosome and other four super-populations of the 1000 Genomes Project were not examined. Evolutionary dynamics, including recombination within the PAR regions, were reasoned as a major contributing factor to sdMAF, but genotyping errors and the agreement with the high coverage data were not examined.

In addition to higher missing rates for variants on the X chromosome, the two pseudo-autosomal regions (PARs), PAR1 and PAR2, create further challenges for the analysis at the boundaries between PARs and NPR [ 14 ]. PAR1 is 2.75 Mb at the end of the short arms of the X and Y chromosomes (Xp22.33 and Yp11.32-p11.2) containing 16 genes, while PAR2 is at the tip of the long arms (Xq28 and Yq12) and is 320 kb, containing 4 genes [ 15 – 17 ]. Because there is no recombination of NPR in males, but recombination occurs in PAR1 and PAR2 in both sexes, variants in PARs close to the PAR-NPR boundaries are linked to variants in NPR of the X and Y chromosomes in a sex-specific fashion. Obligatory recombination occurs in PAR1 in males, making it the region with the highest recombination rate per physical distance in the human genome, while only ~2% of meiosis feature recombination at PAR2. Although the effects of sex-specific recombination rates in PAR1 and PAR2 on linkage have been examined for non-parametric linkage analysis of affected sibpairs [ 18 ], the implications for X chromosomal data collected for association studies have not been well explored. The XTR/PAR3 in Xq21.3/Yp11.1 adds further complexity as this 3.91 Mb region on the X chromosome (3.38 Mb on the Y chromosome) is embedded in the non-pseudoautosomal region and has 98.8% sequence homology between X and Y [ 19 – 22 ].

Most genotype calling, imputation and sequence analyses of X chromosomal variants apply methods and tools that were developed for the autosomes [ 1 , 2 ]. However, there are reports that genotype missing rate is higher for SNPs on the X chromosome than autosomes [ 1 , 2 ]. In the non-pseudoautosomal region (NPR) of the X chromosome males are hemizygous, meaning that the intensity of allele signals from genotyping arrays, or the number of reads from sequencing, is half that of homozygous females. This may result in variant positions having higher missing rates for males than females.

After the striking observation that the X chromosome was excluded from most genome-wide association studies (GWAS) [ 1 ], there has been a slow increase in the incorporation of analysis of the X chromosome [ 2 – 5 ]. Several association methods have recently been developed for the X chromosome [ 2 – 12 ], focussing on the phenomenon of X-inactivation [ 13 ], also known as dosage compensation. However, these X chromosome specific downstream methods, similar to those developed for the autosomes, typically presume high quality data and implicitly assume that there is no sex difference in minor allele frequency (sdMAF).

To determine whether sdMAF also is present for rarer variants, we used the non-Finnish Europeans from gnomAD v3.1.2. to examine 155,828 X chromosomal variants with MAF 0.1% to 5%. These would typically have a minimum minor allele count of 50. Significant sdMAF was observed at the PAR1 and PAR2 boundaries with NPR, as well as those in PAR3 ( S28 Fig ). In addition, there were other regions including around 142 Mb, where there were clusters of rarer SNPs with significant sdMAF where female MAFs were greater than male. This later region was not observed strongly in the variants with MAF>5% in the same data ( Fig 8 ). Online locuszoom of these results is available: https://my.locuszoom.org/gwas/473034/?token=78b9004c71f04892adf5f514be014044

To illustrate the sdMAF results for specific SNPs, we selected eight SNPs with the most significant sdMAF from the gnomAD non-Finnish European and African/African American populations, one each from the four regions ( Table 1 ). There is remarkably consistency of sdMAF direction and magnitude across these two populations. Of note, the SNP genotype call rate in males are generally lower than females (paired non-parametric test p value = 0.0026) for these SNPs, which may provide insight into some mechanisms resulting in sdMAF.

A: sdMAF p-values for bi-allelic SNPs with MAF ≥5% and the total number of alleles >30,000 in the African/African American population, presumed to be of high quality. SNPs in the PAR1, PAR2 and PAR3 regions are plotted in grey, with PAR3 located around 90 Mb. Y-axis is -log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on -log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the -log10 scale). B: Female—Male sdMAF for the same SNPs in part A.

A: sdMAF p-values for bi-allelic SNPs with MAF ≥5% and the total number of alleles >50,000 in the Non-Finnish European population, presumed to be of high quality. SNPs in the PAR1, PAR2 and PAR3 regions are plotted in grey, with PAR3 located around 90 Mb. Y-axis is -log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on -log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the -log10 scale). B: Female—Male sdMAF for the same SNPs in part A.

To be consistent with the 1000 Genomes Project analysis, the primary analyses were restricted to SNPs with MAF>5% and where the majority of samples had genotypes. This resulted in 53,002 SNPs in non-Finnish Europeans and 97,438 in African/African Americans ( S26 and S27 Figs, respectively). These analyses confirm the existence of many regions with sdMAF that we documented in the 1000 Genomes Project data, including the PAR1-NPR boundary, the PAR2-NPR boundary and the centromeric boundary of PAR3 (Figs 8 and 9 , respectively). Online locuszoom plot [ 30 ] for the non-Finnish European data is available: https://my.locuszoom.org/gwas/717341/?token=b784386eb4574ef7ba46c117ed711ccf

To examine consistency of sdMAF in other high coverage whole genome sequence data, we used the genotype and allele counts from gnomAD separately from the two largest populations: the non-Finnish Europeans and Africans/African Americans to examine sdMAF on the X chromosome.

SNPs with significant sdMAF also remain at the centromeric NPR-PAR3 boundary ( S23 Fig ). The persistent presence of small sdMAF p-values, including in the NPR region, is also evident from the QQ plots ( S24 Fig ) and histograms of p values ( S25 Fig ). This strongly suggests that sex-linkage is a major driver for sdMAF at both PAR1 and PAR2, and at the centromeric boundary of PAR3. Association studies of the X chromosome thus must consider sdMAF.

The sdMAF results for PAR1 (Figs 4 and S21 ) and PAR2 (Figs 5 and S22 ) are practically the same between the two phases: 0.30% of SNPs in PAR1 and 12.2% in PAR2 have sdMAF p-values <5e-8 in the high coverage data, as compared to 0.29% in PAR1 and 13.1% in PAR2 in phase 3. The sdMAF SNPs in the high coverage data still occur at the NPR-PAR1 and NPR-PAR2 boundaries, which are evident from the zoomed-in Manhattan plots ( S21 and S22 Figs).

The Manhattan plot of sdMAF p-values in S20 Fig shows that, as compared to Fig 2 for the phase 3 data, the prevalence of genome-wide significant sdMAF is reduced in the high coverage data for NPR and PAR3, about a 10-fold deduction: 0.11% of SNPs in NPR and 0.07% in PAR3 have sdMAF p-values <5e-8 in the high coverage data, as compared to 0.83% in NPR and 0.85% in PAR3 in phase 3. This suggests that genotyping error is a contributing factor to some of the significant sdMAF observed in the phase 3 data. The causes of the remaining sdMAF in NPR and PAR3 in the high coverage data require further examination.

Results of X chromosome-wide sdMAF analysis of the high coverage data, without the liftover restriction, are reported in S3 Table and S19 – S25 Figs. S19 Fig shows the analytical pipeline of selecting bi-allelic SNPs with population-pooled MAF ≥5% using the high coverage data, and S3 Table contains the counts of variants by region, MAF threshold and those excluded.

For the 10 SNPs in PAR1, the high coverage data did not resolve the genome-wide significant sdMAF observed in the phase 3 data (Tables A-AD (pages 2–11) in S2 Note ). In addition, they showed good genotype agreement with the phase 3 data; similar results were observed for 10 SNPs in PAR2 (Tables BR-CU (pages 25–34) in S2 Note ). For the 4 SNPs in NPR (Tables AE-AM (pages 12–14) and Tables BO-BQ, and (page 24) in S2 Note ), sdMAFs are no longer genome-wide significant in the high coverage data, suggesting genotyping error in phase 3. For the 9 SNPs in PAR3 (Tables AN-BN (pages 15–23) in S2 Note ), two persisted with genome-wide sdMAF in the high coverage data, located at the centromeric boundary between PAR3 and NPR, while the remaining seven in PAR3 were no longer significant.

sdMAFs and sex-specific genotype agreements between the two phases were attempted for 130 SNPs, selected based on the smallest sdMAF p-values in the phase 3 data (GRCh37) from the four regions (NPR, PAR1, PAR2, and PAR3). We required success of liftover onto GRCh38, that they were biallelic in both datasets, and with no genotype missingness. The liftover failure rates differed by region: NPR = 92%, PAR1 = 0%, PAR2 = 55%, PAR3 = 82%. Using all criteria left 33 SNPs: 4 in NPR, 10 in PAR1, 10 in PAR2, and 9 in PAR3 ( S2 Note with SNPs ordered by the GRCh37 positions).

In contrast, on chromosome 7 rs78984847 (GRCh37 position = 72053830; sdMAF p-value = 3.5e-18) has higher MAFs in females than males in all five super-populations. The population-stratified sdMAF p-values are consistently small, 7.66e-5, 3.06e-4, 1.76e-5, 7.52e-3, and 1.79e-6 respectively in EAS, EUR, AFR, AMR and SAS, but not genome-wide significant as a result of reduced sample sizes. In addition, HWE p-values are much smaller in females than males for all five super-populations, with excess of heterozygous females; the HWE p-values in females are <5e-8 in EUR, AFR and SAS, and 2.92e-7 in EAS and 1.48e-5 in SAS ( S1 Data ). There is also evidence for departure from HWE in males; the HWE p-values in males are 2.02e-2, 1.67e-3, 1.21e-4, 2.57e-4 and 1.57e-2 respectively in EAS, EUR, AFR, AMR, and SAS, with excess of CC males. BLAST of a 100 nucleotide sequence centred on this SNP identified multiple close matches to other chromosomes, including the X chromosome.

To obtain better insight into the nature and source the sdMAF on the autosomes, we then further examined six SNPs, two from each of the three autosomes with the smallest sdMAF p-values ( S1 Data ). For example, on chromosome 1 rs10803097 (GRCh37 position = 243050350; sdMAF p-value = 2.15e-25) has higher MAFs in males than females in all five super-populations, but the population-specific sdMAFs are only genome-wide significant in EAS and AFR ( S1 Data ) while sdMAF p-value = 0.12 in SAS, suggesting heterogeneity in sdMAF between the super-populations at this autosomal SNP. The HWE testing are genome-wide significant in males in all super-populations (except SAS) with excess of heterozygous males, but not in females. A BLAST [ 28 ] search of a 100 nucleotide sequence flanking this SNP identified perfect match to sequence on the NPR region of the Y chromosome (GRCh38 position:11786038), with the Y chromosome having the alternate allele at the SNP, suggesting that it is a Paralogous Sequence Variant [ 29 ].

Chromosomes 1, 7 and 22 were selected as the longest, most similar in length to the X chromosome, and one of the shortest. There were 530,434, 406,057 and 97,216 biallelic SNPs with global MAF ≥5% on these chromosomes, respectively. Manhattan plots of sdMAF p-values ( S15 – S17 Figs), as well as the histograms and QQ plots ( S18 Fig ), show that there are very few SNPs on these autosomes with genome-wide significant sdMAF. Specifically, only 6, 1 and 0 SNPs on chromosomes 1, 7 and 22 had sdMAF p-values <5e-8, respectively. These numbers are significantly lower than that found for the X chromosome.

We performed a sliding window analysis, using a window size of 50 SNPs, sliding 25 SNPs at a time using the mean of the single SNP–log(10) p values. This identified ten regions across the X chromosome, of which the three at the boundaries of the PARs with NPR remain ( S14 Fig ).

Additionally, without requiring the sdMAF testing to be genome-wide significant, S13 Fig shows Bland-Altman plots, stratified by the four regions, for X chromosome SNPs with different minor alleles between males and females. Among this set of SNPs, only one SNP, in PAR2 (POS = 154941870), has close to sex-specific genotypes: A1A1 = 0, A1A2 = 20, A2A2 = 1251 in females, and A1A1 = 9, A1A2 = 1219, A2A2 = 5 in males, and has genotype-wide significant sdMAF ( S2 Data ).

For the PAR1 SNP in the 2,697,599 position where HWE testing could be performed in both females and males, HWE testing in males are all genome-wide significant, but in females the HWE p-values >0.8; consistent results were observed for the other three SNPs in PAR1 and PAR2 ( S1 Data ). In addition, a SNP can be monomorphic in females (HWE p-value = NA in that case), while in the male sample the heterozygous genotype may be present for some of the five super-populations.

The population-specific HWE testing results [ 25 – 27 ], however, vary across the eight SNPs or across the five super-populations. For example, for rs6634333 in NPR, the HWE testing p-values in females are genome-wide significant in all but SAS: 1.71e-30, 3.21e-34, 1.09e-18, 1.97–22, and 2.41e-4 respectively in EAS, EUR, AFR, AMR and SAS. The corresponding population-specific HWD delta estimates are all negative, with an excess of heterozygous females. For rs201194898 in NPR and the two PAR3 SNPs, population-stratified HWE testing in females are genome-wide significant, with excesses of heterozygous females in all five super-populations.

For the PAR1 SNP at 2,697,599, the super-population sdMAF p-values are all <1e-200 (except 2.69e-80 in AMR); the corresponding female minus male sdMAFs are all more extreme than −0.46 (−0.34 in AMR). Similar results were observed for the PAR2 SNP at 154,934,295, for which sdMAF p-values <1e-200 in all five super-populations: MAFs are significantly smaller in females than males. In addition, all females in EAS, EUR and AMR are homozygous TT, while males in EAS, AFR and AMR are all AT heterozygotes. In other words, genotype is fixed by sex in EAS and AMR.

For the eight selected SNPs, two from each of the four regions (NPR, PAR1, PAR2, and PAR3) with the smallest sdMAF p-values in the combined sample, the population-specific sdMAF p-values remain genome-wide significant ( S1 Data ). Moreover, the directions of the sdMAF are consistent across the five super-populations. For example, for rs201194898 in NPR (GRCh37 position = 9,377,082), the sdMAF p-values are <1e-200, 8.82e-149, 1.12e-49, 2.50e-89, 2.91e-45, and 1.95e-101, respectively in the ALL (combined) and the EAS, EUR, AFR, AMR, and SAS super-populations; the corresponding female minus male sdMAFs are all >0.27. That is, the MAFs of rs201194898 are significantly larger in females than in males, across all five super-populations. Note that the minor allele, defined based on sex- and population-combined sample, may not have MAF less than 0.5 in a sex- and/or population-stratified sample, and for a SNP in NPR and PAR3 each male only contributes a single allele to the MAF calculation. Similar results are reported in S1 Data for rs6634333 in NPR, as well as the two PAR3 SNPs in the 88,460,295 and 88,462,611 positions (GRCh37).

Regions are plotted separately A: NPR; B: PAR1, C: PAR2; D: PAR3. For each of the four regions, the histogram at the top of the Bland-Altman plot shows the distribution of the sex-combined MAF for bi-allelic SNPs with global MAF ≥5% presumed to be of high quality. The histogram to the right of the plot shows the distribution of the Female—Male sdMAF.

We then examined how the sdMAF relates to the sex-combined MAF. Fig 6 provides Bland-Altman plots [ 24 ] separately for each of the four regions. Of note, for NPR and PAR3 ( Fig 7A and 7D ), SNPs with significant sdMAF tend to have higher MAFs in females and predominantly had sex-combined MAFs in the range 25%-40%. In contrast, for PAR1 ( Fig 7B ), SNPs with significant sdMAF tend to have higher MAFs in males. Finally, for PAR2 ( Fig 7C ), there are sets of clustered SNPs with significant sdMAF in either direction ( Fig 7C ).

A: sdMAF p-values for bi-allelic SNPs with global MAF ≥5% presumed to be of high quality. Y-axis is −log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on −log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the −log10 scale). B: Female—Male sdMAF for the same SNPs in part A, clearly showing PAR3 SNPs with significant sdMAF tend to cluster at one of the NPR-PAR3 boundaries around 88.5 Mb.

A: sdMAF p-values for bi-allelic SNPs with global MAF ≥5% presumed to be of high quality. Y-axis is −log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on −log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the −log10 scale). B: Female—Male sdMAF for the same SNPs in part A, clearly showing PAR2 SNPs with significant sdMAF tend to cluster at the NPR-PAR2 boundary around 88.5 Mb.

A: sdMAF p-values for bi-allelic SNPs with global MAF ≥5% presumed to be of high quality. Y-axis is −log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on −log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the −log10 scale). B: Female—Male sdMAF for the same SNPs in part A, clearly showing PAR1 SNPs with significant sdMAF tend to cluster at the NPR-PAR1 boundary around 2.6 Mb.

Fig 2B shows that the direction and magnitude of the sdMAF varies by genomic location. Among the analysed SNPs, 0.39%, 16.88%, 1.54% and 1.18% of the SNPs have absolute sdMAF ≥0.05, respectively in PAR1, PAR2, PAR3, and NPR. For these SNPs, the medians of absolute(sdMAF) are respectively 0.095, 0.121, 0.074, and 0.077, and the [Q1,Q3] are [0.060,0.274], [0.064,0.286], [0.060,0.099], and [0.059,0.111], respectively, in PAR1, PAR2, PAR3, and NPR regions. Of the 2,039 SNPs with significant sdMAF we observed females generally having higher MAF than males: NPR = 93%, PAR2 = 59% and PAR3 = 86% among the SNPs with genome-wide significant sdMAF, except for PAR1 = 31% (Figs 4 – 6 ). Specifically, females have higher MAF at around 30 Mb (GRCh37) and at the q-arm of the centromere ( Fig 2B ), as well as at the centromeric boundary of PAR3 ( Fig 6 ). In contrast, at the region of PAR1 close to the NPR boundary, males tend to have higher MAF among the SNPs with significant sdMAF ( Fig 4 ). Finally, there are sets of PAR2 SNPs close to the NPR boundary that have higher MAF in females, while other sets of SNPs have higher MAF in males ( Fig 5 ).

The 8 SNPs were selected (two each from the 4 regions: PAR1, PAR2, PAR3 and NPR) with the smallest sdMAF p values (see text). Analysis was performed separately for the 26 populations and sdMAF (female-male, blue circles) as well as their corresponding -log 10 (p value) (black triangles) are plotted on the left and right Y axes, respectively. Populations are labelled on the X axis using the conventional 1000 Genomes Project codes, and grouped by super-population. Physical position is GRCh37 and rs provided where available. The red dashed horizontal line is a p = 5x10 -8 . SNPs in each part are: A position = 2697599 (PAR1); B position 2698923 (PAR1) and C position 154934295 (PAR2); D position 154936183 (PAR2); E position = 9377082 rs201194898 (NPR); F position = 88460295 (PAR3); G position = 88462611 (PAR3); H position = 140993859 rs6634333 (NPR).

Fig 3 shows that, as expected, the sdMAF results for these SNPs are consistent across the 26 populations. For example, for rs6634333 (POS = 140993859) from the NPR region ( Fig 3H ), the sdMAF estimate in the overall sample is 0.338 (p-value = 3.78E-151). While the 26 population-specific estimates are not identical, all 26 estimates are positive and range from 0.206 (p-value = 0.011) in STU to 0.443 (p-value = 2.90E-9) in ESN ( S5 Data ). Remarkable consistencies across populations are also observed for the other seven SNPs ( Fig 3 ).

Although unlikely, it is possible that the above continent-stratified analysis may fail to identify sdMAF SNPs with heterogeneity (in terms of signs of the sdMAF estimates) in sdMAF between populations. Thus, for the eight X chromosomal SNPs with the most salient results (as initially discussed in S1 Data ) from the 1000 Genomes phase 3 data, we further conducted a population-stratified sdMAF analysis, separately for each of the 26 populations.

In contrast, although the individual super-population (continent-stratified) analysis identified some SNPs with significant sdMAF that would be missed by the ALL analysis ( S2 Table ), the sdMAF p values are comparable as shown in S9 Fig . There is also a remarkable consistency in sdMAF estimates between the five super populations as shown in S10 Fig . We also performed a meta-analysis with sample-size based weighting of all five super-populations and compared the results to the primary ALL analysis (above). S11 Fig (Miami) and S12 Fig (pp plot) show that the results are consistent between the two analyses.

The top row of S9 Fig shows that the ALL analysis is overall much more powerful than the super-population-specific analysis, identifying many SNPs with significant sdMAF that would be missed by the individual super-population analysis, and often the sdMAF p value is several magnitudes smaller.

Our primary analysis (above) combined data from all five super-populations. We also performed analysis separately for each super-population to determine whether the effect magnitude and direction were consistent across super-populations. S4 – S8 Figs present the results for each super-population and show that, based on the comparisons of the p values, there are generally similar sdMAF patterns across the five super-populations. The consistency is also present for the magnitude and direction of sdMAF ( S9 – S10 Figs).

A: sdMAF p-values for bi-allelic SNPs with global MAF ≥5% presumed to be of high quality. SNPs in the PAR1, PAR2 and PAR3 regions are plotted in grey, with PAR3 located around 90 Mb. Y-axis is −log10(sdMAF p-values) and p-values >0.1 are plotted as 0.1 (1 on −log10 scale) for better visualization. The dashed red line represents 5e-8 (7.3 on the −log10 scale). B: Female—Male sdMAF for the same SNPs in part A. For Zoomed-in plots for the PAR1, PAR2 and PAR3 regions see Figs 4 , 5 and 6 , respectively.

Variants were placed into the NPR, PAR1, PAR2, and PAR3 regions based on positions available from The Genome Reference Consortium and [ 19 ]. For detailed counts of variant types and global MAF by regions, see S1 Table .

For biallelic SNPs with global MAF≥5% ( Fig 1 and S1 Table ), the Manhattan plot for sdMAF p-values ( Fig 2A ) shows that a non-negligible proportion of X chromosomal SNPs have significant sex difference in MAF even at the genome-wide significance threshold of p-value <5e-8: 0.83% of SNPs in NPR, 0.29% in PAR1, 13.1% in PAR2, and 0.85% in PAR3. The excesses of small p-values for sdMAF testing in all four regions of the X chromosome are also evident from both the QQ plots ( S2 Fig ) and histograms of the sdMAF p-values ( S3 Fig ). SNPs with significant sdMAF are located across the X chromosome but tend to cluster in specific regions ( Fig 2A ).

Discussion

Our initial sdMAF analysis focused on the 1000 Genomes Project phase 3 data since it has been examined extensively for association analysis, and is one of the most commonly used imputation panels for GWAS [31]. With the recent release of the high coverage data, we first compared results between the two phases for specific SNPs. In addition, we also performed a separate X chromosome-wide analysis of the 1000 Genomes Project high coverage data aligned to GRCh38 as well as mostly high-coverage whole genome sequence aligned to GRCh38 from two populations from gnomAD.

In the 1000 Genomes Project phase 3 data, focused on eight selected SNPs, we showed that the sdMAF was robust to population stratification either coming from super-population or population levels. Yet, we cannot exclude the possibility that there could be additional SNPs with sdMAF that we have missed. Similar concerns relate to the analysis of the gnomAD data, where our analysis was performed separately for the two major super-populations (likely defined by autosomal population structure), the results could be confounded by population structure and/or admixture within them. However, S10 Fig shows that sdMAF estimates are consistent between the five super populations across the whole X chromosome in the 1000 Genomes phase 3 data. Additionally, Fig 3 shows that the sdMAF estimates are extremely consistent between the 26 populations for the eight selected SNPs with the smallest sdMAF p value.

We identified two likely sources of sdMAF: genotyping error and sex-linkage. Genotyping error (which may in part be due to differences between GRCh37 and 38) accounts for many NPR and PAR3 sdMAF in phase 3 since they were mostly resolved in the high coverage data. However, sdMAF for some NPR and PAR3 SNPs remain in the high coverage data and are also present in gnomAD. Thus, despite the recent advances in how to analyze NPR SNPs [6], our findings here show that robust X chromosomal association methods must consider sdMAF caused by genotyping error and/or sex-specific selection. For example, for sex-dimorphic traits, others have suggested sex-specific analysis genome-wide [32]. Updated reference sequence along with alternate haplotypes for regions of the X and Y chromosomes may improve the mapping of sequence reads and genotype quality [33,34].

In contrast, the impact of sex-linkage in PAR1 and PAR2 results in most sdMAF identified in phase 3 persisting in the high coverage data. Sex-linkage at PARs has previously been discussed for linkage analyses using affected sibpairs [18], but it has not been examined in the context of association studies [16,17]. Multiple authors have stated that association methods routinely used for autosomes can be applied to the PARs [3–5]. However, our results indicate that association analyses at the PAR-NPR boundaries (and at the centromeric boundary of PAR3) should consider sdMAF caused by sex-linkage. The optimal region-specific solution is an open research question.

We searched the NHGRI-EBI GWAS catalog [35] and identified multiple signals at the PAR boundaries. These include loci: for BMI, multiple lipids, red cell traits, non-syndromic metopic craniosynostosis at the PAR2 boundary [36–39]; ANCA-associated vasculitis, Alzheimer’s disease, non-syndromic metopic craniosynostosis, susceptibility to TB, 3-hydroxy-1-methylpropylmercapturic acid levels and adenosine diphosphate at PAR3 [36,40–45]; age of onset of myopia, mean corpuscular volume, eosinophil count, and asthma [43][39,46,47] close to PAR1/NPR boundary. Apart from two studies [40,45], none of others provided sex-specific results, making it difficult to determine the effect of sdMAF on these associations.

The PAR1-NPR boundary is in intron 3 of XG. The genetic basis of the Xga blood group has recently been studied [48–50] with groups independently identifying rs311103 in PAR1 as the potential causal variant. Other work has also suggested that a large deletion that spans PAR1-NPR could also be a separate causal variant for Xga [51]. Polymorphic duplications of the Y chromosome neighbouring PAR1, as well as a deletion of the overlapping region on the X chromosome have been characterized [52]. Since these CNVs are relatively rare, e.g. as described in gnomAD v 2.1 (Web Resources) [53], they are unlikely to result in the major sdMAF seen at the PAR1-NPR boundary.

A small proportion (3%) of sdMAF in phase 3 were in the 19 X/Y homologs. Further examination of allele frequencies of the Y chromosome is interesting but beyond the scope of this work.

It is unlikely that strand flips are a major cause of the difference in genotype by sex, since variant calling was performed blind to sex and typically joint-called, especially for the high coverage 1000 Genomes Project data [54] and the gnomAD data [55]. Further, inspection of Integrative Genomics Viewer (IGV) plots in gnomAD v3.1.2 shows that there are few discrepancies of sequencing reads around these SNPs, making it unlikely that they are strand flips. Sex-specific quality metrics for variants on the X chromosome could help identify variants with technical interference.

Of note, for many variants with sdMAF, there was also deviation from Hardy-Weinberg equilibrium [27] in each of the super-populations in females (NPR and PAR3) and males (PAR1 and PAR2) (S1 Data). HWD testing of variants in the 1000 Genomes phase 3 data has been examined separately in the JPT and YRI populations [56,57]. SNPs with missing rate <5% and possessing an rs identifier were used. The earlier work [56] found lower rates of deviation from HWE on the X chromosome (after exclusion of PAR1 and PAR2) than on any of the autosomes, but the sample size for the X chromosome is ~3/4 of that for the autosomes. Results from our sdMAF analysis also calls for new X chromosome-aware HWD methods that consider sdMAF.

Earlier work has examined HWD of X chromosomal variants (59), using a 2 degrees of freedom (df) Pearson Chisq-based test that jointly analyzes both females and males. The test includes, in addition to the female data, the deviation of male genotype counts from the expected, based on pooled allele frequency estimate using both male and female data. However, it has been shown that this 2 df HWD test is equivalent to testing for HWD in females alone and simultaneously testing for sdMAF between the sexes [58]. Because of this confounding between HWD and sdMAF, for variants in the PAR1 and PAR2 regions, we performed the HWD analysis stratified by the sex. And for NPR and PAR3 variants, we performed the HWD analysis in females only; the males have two hemizygous genotypes, leaving no df to perform the HWD test after using one df to estimate the MAF in males.

In addition to the 2 df Chisq test, the work of [57] also proposed and recommended an Exact test, jointly analyzing both females and males. Although the Exact test does not specify the degrees of freedom, our preliminary work suggests that it is similar to the 2 df Chisq test in spirit, because the Exact test includes males and is derived conditional on the “number of A alleles (nA) and the observed male and female frequencies.” This could lead to increased type I error HWD test in the presence of true sdMAF. For example, consider a SNP from the NPR region with significant sdMAF but no deviation from HWE in females: female MAF = 0.2 with genotype counts of (640, 320, 40) and male MAF = 0.5 with genotype counts of (500, NA, 500). In that case, the sex-combined EXACT test p = 6.41E-61 (using the R program HardyWeinberg v1.7.4 [59]), but this highly significant Exact HWD testing result is due to the significant sdMAF alone. When the HWD analysis is performed using only females and in the presence of small sample, the Exact test can be advantageous [56,57].

Beyond the X chromosome, joint and separate analyses of sdMAF and HWD have also been performed for autosomal variants [60]. Interestingly, an application to autosomal SNPs using the 104 individuals from the JPT sample of the 1000 Genomes Project showed that, in the presence of sdMAF, true HWD may go unnoticed if applying the standard HWD testing using both males and females.

How to solve the visualization of Manhattan/Miami plots for the X chromosome stratified by sex, while also highlighting the four regions, is a worthy challenge for bioinformaticians. Additionally, as both the association and sdMAF analyses differ between regions, it is crucial to assign variants at the NPR-PAR boundaries to the correct region.

Since the 2013 call for X chromosome-inclusive GWAS [1] several methods have been developed, all focused on robustifying the association analysis against the well-known phenomenon of X-inactivation uncertainty [2,3,5–12]. However, the sex differences in minor allele frequencies phenomenon documented here could affect the validity of existing X-inactivation-aware association methods against spurious sdMAF due to, for example, genotyping error. As sex is typically included in X chromosomal association analysis, it is reasonable to assume that sdMAF is accounted for through the inclusion of sex as covariate. However, further research is needed particularly for studies of traits displaying significant sexual dimorphism. For those traits, there could be a sex-ratio difference between the cases and controls, and this combined with sdMAF could have implications for association studies of the X chromosome. Finally, sdMAF at the NPR-PAR boundaries is likely a biological phenomenon. Thus, how to leverage true sdMAF to increase association power is also an open research question.

[END]
---
[1] Url: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010231

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/