(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias [1]

['Taavi Päll', 'Institute Of Biomedicine', 'Translational Medicine', 'University Of Tartu', 'Tartu', 'Hannes Luidalepp', 'Quretec', 'Tanel Tenson', 'Institute Of Technology', 'Ülo Maiväli']

Date: 2023-03

We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π 0 , the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π 0 -s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π 0 -s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π 0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.

Funding: This work was supported by the European Regional Development Fund through the Centre of Excellence in Molecular Cell Engineering (2014-2020.4.01.15-0013 for ÜM and TT) and by grants from the Estonian Research Council (PRG335 for ÜM and TT; PUT1580 for TP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The code to produce the raw dataset is available on the rstats-tartu/geo-htseq Github repo ( https://github.com/rstats-tartu/geo-htseq ). The raw dataset produced by the workflow is deposited in Zenodo https://zenodo.org with doi: 10.5281/zenodo.7529832 ( https://doi.org/10.5281/zenodo.7529832 ). The code to produce the article’s figures and models is deposited on the rstats-tartu/geo-htseq-paper Github repo ( https://github.com/rstats-tartu/geo-htseq-paper ). Individual model objects are deposited in G-Node with doi: 10.12751/g-node.p34qyd ( https://doi.org/10.12751/g-node.p34qyd ). Code and workflow used to run and analyze RNA-seq simulations are deposited in Zenodo with doi: 10.5281/zenodo.4463804 ( https://doi.org/10.5281/zenodo.4463804 ). Processed data, raw data and workflow of the RNA-seq simulations with input fasta file is deposited in Zenodo with doi: 10.5281/zenodo.4463803 ( https://doi.org/10.5281/zenodo.4463803 ).

We concentrate on field of differential expression profiling studies using high-throughput sequencing (HT-seq DE, mostly RNA-seq) for 2 reasons. First, HT-seq has become the gold standard for whole transcriptome gene expression quantification in research and clinical applications [ 25 ]. Secondly, due to the massively parallel testing in individual studies of tens of thousands of features per experiment, we can access study-wide lists of p-values. From the shapes of histograms of p-values, we can identify the experiments where p-values were calculated apparently correctly, and from these studies, we can estimate the study-wise relative frequencies of true null hypotheses (the π 0 -s). Also, we believe that the very nature of the HT-seq DE field, where each experiment compares the expression levels of about 20,000 different features (e.g., RNA-s) on average, predicates that the quality of data analysis and specifically statistical inference based on p-values must play a decisive part in scientific inference. Simply, one cannot analyze an HT-seq DE experiment intuitively, without resorting to formal statistical inference. Therefore, quality problems of statistical analysis would very likely directly and substantially impact the quality of science. Thus, we use the quality of statistical analysis as a proxy for the quality of science, with the understanding that this proxy may work better for modern data-intensive fields, where a scientist’s intuition has a comparatively more minor role to play.

While our work uses published p-values as evidence for field-wide quality and presupposes access to unbiased complete sets of unadjusted p-values, it does not pool the p-values across studies, nor does it assume that they were correctly calculated. In fact, we assume the opposite and do a study-by-study analysis of the quality of calculating p-values. This makes the quality of the p-value a proxy for the quality of the experiment and the scientific inferences based on these p-values. We do not see our estimate of the fraction of poorly calculated p-values as a formal quality metric but merely hope that by this measure, we can shed some light on the overall quality of a field.

Another proxy for quality can be found in published p-values, especially the distribution of p-values [ 21 ]. In a pioneering work, Jager and Leek extracted ca. 5,000 statistically significant p-values from abstracts of leading medical journals and pooled them to formally estimate, from the shape of the ensuing p-value distribution, the Science-Wide False Discovery Rate or SWFDR as 14% [ 22 ]. However, as this estimate rather implausibly presupposes that the original p-values were calculated uniformly correctly and that unbiased sets of significant p-values were obtained from the abstracts, they subsequently revised their estimate of SWFDR upwards, as “likely not >50%” [ 23 ]. For observational medical studies, by a different method, a plausible estimate for field-wide false discovery rate (FDR) was found to be somewhere between 55% and 85%, depending on the study type [ 24 ].

Another way to assess the large-scale quality of a science is to employ surrogate measures for quality that can be more easily obtained than the full replication of a study. The most often used measure is technical reproducibility, which involves checking for code availability and running the original analysis code on the original data. Although the evidence base for reproducibility is still sketchy, it seems to be well below 50% in several fields of biomedicine [ 18 ]. However, as there are many reasons why a successful reproduction might not indicate a good quality of the original study or why an unsuccessful reproduction may not indicate a bad quality of the original study, the reproducibility criterion is insufficient.

Over the past decade, a feeling that there is a crisis in experimental science has increasingly permeated the thinking of methodologists, captains of industry, working scientists, and even the lay public [ 1 – 6 ]. This manifests in poor statistical power to find true effects [ 7 , 8 ], in poor reproducibility (defined as getting identical results when reanalyzing the original data by the original analytic workflow), and in poor replicability (defined as getting similar results after repeating the entire experiment) of the results [ 9 ]. The proposed reasons behind the crisis include sloppy experimentation, selective publishing, perverse incentives, difficult-to-run experimental systems, insufficient sample sizes, overreliance on null hypothesis testing, and much-too-flexible analytic designs combined with hypothesis-free study of massively parallel measurements [ 10 – 15 ]. Although there have been attempts at assessing experimental quality through replication of experiments, prohibitive costs and theoretical shortcomings in analyzing concordance in experimental results have encumbered this approach in biomedicine [ 16 – 19 ]. However, results from a recent replication of 50 experiments from 23 high-profile preclinical cancer studies indicate that ca 90% of effect sizes were overestimated and that over half of the published effects were either in the wrong direction or were wrongly assigned to the non-null category [ 20 ].

(A-F) Sankey charts of transformation of p-value histogram shape. Ribbon size is linearly proportional to the number of p-value sets that change their distributional class. Only the 3,426 experiments that could be subjected to this treatment are depicted. ( A ) Complete data, N = 3,426. ( B ) The subset where the p-values were calculated with cuffdiff, N = 1,116. ( C ) The subset where the p-values were calculated with DESeq, N = 252. ( D ) The subset where the p-values were calculated with DESeq2, N = 1,114. ( E ) The subset where the p-values were calculated with edgeR, N = 515. ( F ) The subset where the p-values were calculated with limma, N = 73. ( G ) Posterior summaries of anti-conservative p-value histogram proportions in raw and filtered p-value sets. Filtered p-value data is from a Bernoulli model [anticons ~ de_tool], N = 3,426. The data files are in S13 Data and in S14 Data (for raw data). ( H ) Effect sizes in percentage points of low-count feature filtering to the proportion of anti-conservative p-value histograms. The data files are in S13 Data and in S14 Data (for raw data). ( I ) Posterior summaries of π 0 values of p-value histograms in raw and filtered p-value sets. Filtered p-value data is the p-value from the beta model [pi0 ~ de_tool], N = 2,042. The data files are in S15 Data and in S16 Data (for raw data). ( J ) Effect sizes in π 0 units (percentage points) of low-count feature filtering to π 0 . The data files are in S15 Data and in S16 Data (for raw data). The model object related to filtered p-value sets in panel G can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/anticons_detool_filtered.rds . The model object related to filtered p-value sets in the panel I can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/pi0_detool_full_data_filtered.rds . See S16 Fig for all platforms.

We found that overall we could increase the proportion of anti-conservative p-value histograms by 2.4-fold, from 844 (24.6%) to 2,022 (59.0%), and the number of uniform histograms by 2.5-fold from 8 (0.23%) to 20 (0.6%) ( Fig 7A ). For all analysis platforms, most rescued p-value distributions came from classes “bimodal” and “other,” while almost no rescue was detected from conservative histograms ( Fig 7B–7F ). After removing low count features, the proportion of anti-conservative p-value histograms increased for all analysis platforms, with the largest effects observed for cuffdiff and deseq, which presented the lowest pre-rescue fractions of anti-conservative p-value histograms ( Fig 7G and 7H ). Nevertheless, substantial differences between analysis platforms remain, indicating that the removal of low-count features, while generally beneficent, was insufficient to entirely remove the sources of bias originating from the analysis platform. Also, the π 0 -s calculated from the rescued anti-conservative p-value sets have very similar distributions compared to π 0 -s from the pre-rescue anti-conservative p-value sets and concomitantly very similar dependence on the analysis platform ( Fig 7I and 7J ).

We observed a slight reduction in the mean number of p-values per GEO experiment of anti-conservative and uniform histograms compared to other p-value histogram classes, suggesting that p-value sets with anti-conservative and uniform shapes are more likely to have been pre-filtered or have been more extensively pre-filtered ( S1 Fig ). Accordingly, we speculated that by further filtering out features with low counts, we could convert some of the untoward p-value histograms into anti-conservative or uniform types. Our goal was not to provide optimal interventions for individual datasets, which would require tailoring the filtering algorithm for the requirements of a particular experiment. We merely aim to provide proof of principle that by a simple filtering approach, we can increase the proportion of anti-conservative p-value sets and/or reduce the dependence of results on the analysis platform. Therefore, we applied filtering to 3,426 p-value sets where we could identify gene expression values (see Methods for details on selecting filtering thresholds).

( A ) Calculated π 0 -s (on the y-axis) from simulated data vs. given “true” proportions of DE-features (on the x-axes). Sample sizes are indicated in color code. The dotted line shows the perfect correspondence between the given π 0 -s and the estimated π 0 . The data file is in S11 Data . ( B ) Dependence of mean π 0 from the binned sample sizes with 95% CI. Robust linear model [pi0 ~ N], Student’s likelihood. Points denote best estimates for the mean and thick and thin lines denote 66% and 95% credible intervals, respectively. The data file is in S12 Data . Model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/pi0%20~%20N.rds .

According to the statistical theory, there is a linear dependence of the power of the experiment on the square root of sample size. An underpowered experiment results in a smaller number of near-zero p-values, leading to overestimation of the π 0 at small N-s, especially at N = 2, as confirmed by simulation ( Fig 6A ). Thus, reducing the power will ultimately convert an anti-conservative p-value distribution into a uniform one. However, we not only have very few uniform p-value distributions in our empirical data but also few near-one estimated π 0 -s that could be construed as manifestations of low power ( Fig 3A ). We looked into this further by plotting the π 0 values versus sample size ( Fig 6B ). As a result, instead of the expected decrease of the mean π 0 upon increasing the sample size, we see a slight increase of π 0 -s at sample sizes >5, while the overall correlation is negligible (r = 0.06; 95% CI [0.007, 0.12]). Thus, there does not appear to be a sample size-dependent bias in π 0 -s, as estimated from the GEO-submitted p-value distributions. As the true power of most small sample DE RNA-seq experiments is expected to be low, this result leads us to question further the statistical adequacy of the underlying p-values, of which the π 0 -s are but single-number summaries. These results show that statistical methods used in HT-seq DE analysis tend to produce highly suspect output even when the p-value distribution is anti-conservative.

These results indicate a clear causal link between sample size and technical quality of most current workflows of DE analysis. Thus, a low power not only leads to missed discoveries and overestimated DE-s, as predicted by theory [ 35 ], but it also seems to be destructive towards the algorithms that calculate these DE-s.

( A ) Increasing sample sizes are associated with an increased fraction of anti-conservative p-value distributions. Logistic model [anticons ~ N], Bernoulli likelihood, N = 2,392. The data file is in S9 Data . ( B ) Non-anti-conservative p-value distributional classes are more likely to have extremely low sample sizes of 1 or 2 and less likely to have sample sizes of 4 or more. The y-axis denotes the proportion of experiments with a given sample size calculated from the total number of experiments in a given p-value distributional class. Logistic model [n | trials(total_in_class) ~ Class + Nb + Class:Nb], binomial likelihood, N = 2,392. Points denote best estimates for the mean and thick and thin lines denote 66% and 95% credible intervals, respectively. The data file is in S10 Data . The model object related to panel A can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/anticons__N.rds . The model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/n%20%7c%20trials%28nn%29__Class%20+%20Nb%20+%20Class:Nb.rds .

Experiments that did not result in anti-conservative p-value distributions were less likely to have samples larger than 3 and more likely to have sample sizes of 2 and 1 than the experiments that led to anti-conservative p-value distributions ( Fig 5B ). This effect was especially pronounced for experiments with conservative p-value distributions, of which 52% had sample sizes of 1 or 2, as compared with 26% for anti-conservative p-value distributions. Overall, the experiments resulting in anti-conservative p-value distributions have a slightly larger mean sample size (3.5 versus 2.7; p = 10 −9 ).

There is a clear trend, shown in Fig 5A , whereby very small samples of 5 or fewer lead to progressively smaller fractions of anti-conservative p-value distributions. While the probability of obtaining an anti-conservative p-value distribution is only around 0.1, if the sample size is one, it increases to 0.3 for sample size 3 and further to about 0.5 for sample size 5. We could detect no further increase in this probability as sample size grew further (but note that the vast majority of experiments have sample sizes of 4 or less). As our simulations indicate that reasonably powered (power is around 0.7 to 0.8) experiments start around sample sizes 4 to 5 for multicellular systems, this result suggests that the existing DE analysis algorithms may work poorly on data obtained from underpowered experiments.

Are the observed sample sizes sufficient to convey reasonable power to DE RNA-seq experiments? Simulations on idealized data with natural variation and effect sizes lead N = 2 experiments to 30% to 60% power and N = 3 experiments to 50% to 70% power for human/mouse experiments, where higher π 0 -s are associated with lower power ( Fig 4B ). The power of real-life experiments is likely to be even lower, due to imperfect match between the data and the statistical model used to analyze it [ 34 ]. Thus, assuming that realistic levels of biological variation are captured in the samples, most DE RNA-seq experiments conducted in eukaryotic systems must be underpowered. This conclusion is consistent with the methodological literature (see Discussion ).

We assigned sample sizes to 2,393 GEO submissions (see Materials for details). From these, 91% had sample sizes of 4 or less, 25% had N of just 2 and 12% used a single sample from which to calculate p-values, thereby entirely ignoring biological variation ( Fig 4A ). Only 1% of experiments had sample sizes over 10.

In addition, controlling for time, taxon, or sequencing platform, did not substantially change the association of DE analysis tools with the π 0 -s ( S6C–S6F Fig ). Recalculating the π 0 -s with a global FDR algorithm [ 32 ] did not lead to substantially different π 0 distribution, although there appears to be a slight directional shift of the distribution, the mean π 0 is shifted from 0.58 in the local FDR method to 0.54 in the global FDR method ( S7 Fig .). As there is a strong association between both π 0 and the proportion of anti-conservative p-value histograms with the DE analysis tool (Figs 2 and 3 ), we further checked for and failed to see similar associations with variables from raw sequence metadata, such as the sequencing platform, library preparation strategies, library sequencing strategies, library selection strategies, and library layout (single or paired) ( S8 – S15 Fig ). These results support the conjecture of specificity of the associations with DE analysis tools.

In terms of experimental design, to get a low π 0 an experiment would have to change the expression of most genes under study substantially. A significant source of such experiments would be comparisons of different cancer cell lines/tissues, where π 0 = 0.4 could be considered a reasonable outcome [ 31 ]. We, therefore compared the π 0 -s coming from GEO HT-seq submissions related to search terms “neoplasms” or “cancer” (for the exact query string, please see Methods ) with all other non-cancer submissions. There is very little difference in the means and standard deviations of π 0 -s for cancer and non-cancer experiments (0.58 (0.22) and 0.59 (0.24), respectively) ( Fig 3C ). Filtering by studies mentioning transcription factor led to very similar results ( Fig 3D ), suggesting that the wide dispersion of π 0 -s is not caused by intentional experimental designs. Also, studies involving cancer and TFs resulted in anti-conservative or uniform p-value distributions with similar probabilities to non-cancer/non-TF studies (risk ratios with 95% CI are 0.95 (0.83; 1.07) and 0.99 (0.74; 1.28), respectively).

( A ) Histogram of π 0 values estimated from anti-conservative and uniform p-value sets. N = 1,188. The data file is in S3 Data . ( B ) Robust linear model [pi0 ~ de_tool, beta likelihood] indicates an association of π 0 with the DE analysis tool. Points denote best estimates for the mean π 0 and thick and thin lines denote 66% and 95% credible intervals, respectively. N = 1,188. The data file is in S4 Data . ( C ) Histogram of π 0 values in GEO cancer studies compared to non-cancer studies. The data file is in S5 Data . ( D ) Histogram of π 0 values in GEO transcription factor studies compared to non-TF studies. The data file is in S6 Data . The model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/pi0_detool_sample.rds .

As non-anti-conservative sets of p-values (excepting the “uniform”) already indicate low quality of both the π 0 estimate, and of the ensuing FDR control [ 30 ], we only calculated the π 0 for datasets with anti-conservative and uniform p-value distributions (N = 1,188). Nevertheless, the π 0 -s show an extremely wide distribution, ranging from 0.999 to 0.02. Remarkably, 37% of the π 0 values are smaller than 0.5, meaning that, according to the calculated p-values, in those experiments, over half of the features are estimated to change their expression levels upon experimental treatment ( Fig 3A ). Conversely, only 23% of π 0 -s exceed 0.8, and 9.9% exceed 0.9. The peak of the π 0 distribution is not near 1, as might be expected from experimental design considerations, but there is a broad peak between 0.5 and 0.8 (median and mean π 0 -s are both at 0.59). Depending on the DE analysis tool, the mean π 0 -s range over 20 percentage points, from about 0.45 to 0.65 ( Fig 3B , see S6A Fig for all DE analysis tools). Using the whole dataset confirms the robustness of this analysis ( S6B Fig .).

To further enquire into DE analysis tool-driven bias, we estimated from user-submitted p-values the fraction of true null effects (the π 0 ) for each HT-seq experiment. The π 0 is a statistic calculated solely from the p-values, and it is routinely used as an intermediate quantity needed to fix FDR at the desired level [ 29 ]. The quality of FDR control depends on the quality of estimation of the π 0 , which in turn depends on the quality of underlying p-values. Thus, the π 0 can be seen as an estimate of the true proportion of true nulls or as a single number summary of the p-value distribution, and in the latter capacity, it can be used as a quality check of the p-values used for FDR control of the experiment.

( A ) Time courses for proportions of different p-value histogram classes for the 5 most frequently used DE analysis platforms. Lines denote best fit of the model [class ~ year + (year | de_tool), categorical likelihood]. Shaded areas denote 95% credible regions. N = 4,616. The data file is in S1 Data ( B ) Association of p-value histogram type with DE analysis tool; data is restricted to 2018–2020 GEO submissions. Points denote best fit of the model [n | trials(total in de_tool) ~ class + de_tool + class:de_tool, binomial likelihood]. Thick and thin lines denote 66% and 95% credible intervals, respectively. N = 2,930. The model object related to panel A can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/Class_year__year_detool_year.rds . The model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.2/models/n__trials%28total_in_de_tool%29__Class_de_tool_Class:de_tool_2018up.rds . The data file is in S2 Data . See S3 Fig for a full set of identified DE analysis tools.

This positive temporal trend in anti-conservative p-value histograms suggests improving the quality of the HT-seq DE field. Somewhat surprisingly, Fig 2A also indicates that different DE analysis tools are associated with very different proportions of p-value histogram classes, suggesting that the quality of p-value calculation, and therefore, the quality of scientific inferences based on the p-values, depends on the DE analysis tool. We further tested this conjecture in a simplified model, restricting our analysis to 2018 to 2020, the final years in our dataset ( Fig 2B ). As no single DE analysis tool dominates the field (the top 5 are DESeq2 28%, cuffdiff 27%, edgeR 14%, DESeq 8%, limma 2%; see S4 Fig for temporal trends), a situation where proportions of different p-value histogram classes do not substantially differ between analysis tools would indicate lack of tool-generated bias. However, we found that all p-value histogram classes, except “uniform,” which is mainly unpopulated, depend strongly on the DE analysis tool (Figs 2B , S3 , and S5 ). This effect is quite extreme, extending from nearly zero fraction in cuffdiff to about 0.75 in Sleuth. Using the whole dataset of 14,813 p-value histograms—as a check for robustness of results—or adjusting the analysis for GEO publication year, of the taxon (human, mouse, and pooled other), of the RNA source or sequencing platform—as a check for possible confounding—does not change this conclusion ( S5B–S5F Fig ). The lack of confounding in our results allows a causal interpretation, indicating that DE analysis tools bias the analysis of HT-seq experiments, while the large DE analysis platform-dependent differences suggest a very substantial bias [ 28 ].

We found that overall, 25% of the histograms fall into the anti-conservative class, 9.5% were conservative, 26% bimodal, and 39% fell into the class “other” ( Fig 1B ). Only 17 of the 4,616 histograms were classified as “uniform.” The median number of features in our sample was 20,954. Interestingly, there are fewer features in anti-conservative p-value histograms, as compared to histograms with all other shapes, suggesting different data preprocessing for datasets resulting in anti-conservative histograms ( S1 Fig ). Logistic regression reveals a clear trend for an increasing proportion of anti-conservative histograms, starting from <10% in 2010 and surpassing 30% in 2020 ( S2 Fig ). Multinomial hierarchical logistic regression indicates that most differential expression (DE) analysis tools exhibit temporal increases of anti-conservative p-value histograms, except for cuffdiff, which has the opposite trend, and glc genomics and deseq, where a clear trend could not be identified (Figs 2A and S3 ). The increase in the fraction of anti-conservative histograms is accomplished by decreases mainly in the class “other” and “bimodal,” depending on the DE analysis tool.

We algorithmically classified the p-value histograms into 5 classes (see Methods for details and Fig 1A for representative examples) [ 27 ]. The “Uniform” class contains flat p-value histograms indicating no detectable true effects. The “Anti-Conservative” class contains otherwise flat histograms with a spike near zero. The “Conservative” class contains histograms with a distinct spike close to one. The “Bimodal” histograms have 2 peaks, one at either end. Finally, the class “Other” contains a panoply of malformed histogram shapes (humps in the middle, gradual increases towards one, spiky histograms, etc.). The “Uniform” and “Anti-Conservative” histograms are the theoretically expected shapes of p-value histograms.

According to GEO submission requirements, the processed data files may contain raw counts of sequencing reads and/or normalized abundance measurements. Therefore, a valid processed data submission may not contain lists of p-values. Nevertheless, we identified p-values from 4,616 GEO series, from which we extracted 14,813 unique unadjusted p-value sets. While the mean number of p-value sets, each set corresponding to a separate experiment, per 4,616 GEO submissions was 3.21 (max 276), 46% of submissions contained a single p-value set, and 76% contained 1–3 p-value sets. For further analysis, we randomly selected 1 p-value set per GEO series.

According to our analysis, the 43,610 GEO series contained 84,036 supplementary data files, including RAW.tar archives. After unpacking RAW.tar files, we programmatically attempted to import 647,092 files, resulting in 336,602 (52%) successfully imported files, whereas 252,685 (39%) files were not imported because they were either SAM, BAM, BED, or in other formats most probably not containing p-values. We failed to import 57,805 (8.9%) files for various reasons, primarily because of text encoding issues and failure to identify column delimiters.

First, we filtered the GEO series containing supplementary processed data files. NCBI GEO database submissions follow MINSEQE guidelines [ 26 ]. Processed data are a required part of GEO submissions, defined as the data on which the conclusions in the related manuscript are based. The format of processed data files submitted to GEO is not standardized, but in the case of expression profiling, such files include, but are not limited to, quantitative data for features of interest, e.g., mRNA, in tabular format. Sequence read alignment files and coordinates (SAM, BAM, and BED) are not considered processed data by GEO.

We queried the NCBI GEO database for “expression profiling by high-throughput sequencing” (for the exact query string, see Methods ), retrieving 43,610 datasets (GEO series) from 2006, when the first HT-seq dataset was submitted to GEO, to December 31, 2020. The yearly new HT-seq submissions increased from 1 in 2006 to 11,604 by 2020, making up 26.6% of all GEO submissions in 2020.

Discussion

In this work, we assess the quality of the differential expression analysis by HT-seq based on a large, unbiased NCBI GEO dataset. Our goal was to study real-world statistical inferences made by working scientists. Thus, we study how experimental design choices and analytic decisions of scientists affect the quality of their statistical inferences, with the understanding that in a field where each experiment encompasses ca. 20,000 parallel measurements of DE on average, the quality of statistical inference is highly relevant to the quality of the scientific inference. To the best of our knowledge, this is the first large-scale study to offer quantitative insight into the general quality of experimentation and data analysis of a large field of biomedical science.

We show that

Overall, three-quarters of HT-seq DE experiments result in p-value distributions that indicate that assumptions behind the DE tests have not been met. However, for many experiments, a simple exclusion of low-count features rescues the p-value distribution. Very few experiments result in uniform p-value distributions that indicate <100 DE features. The sample sizes of the vast majority of HT-seq DE experiments are inconsistent with reasonably high statistical power to detect true effects. Nevertheless, the distribution of π 0 -s, the fraction of true null hypotheses in an experiment, which is estimated solely from the p-value distributions of experiments presenting anti-conservative p-value distributions, peaks at around 0.5, as if in many experiments, most genes were DE. Furthermore, the estimated π 0 -s do not correlate with the sample sizes of the experiments, indicating that the underlying p-value sets are problematic for controlling FDR. The proportion of anti-conservative p-value distributions and the values of π 0 -s are strongly associated with the DE analysis platform.

These results show that not only are most p-value sets statistically suspect (as shown by the preponderance of untoward p-value distributions), but that even the well-behaved anti-conservative p-value sets tend to lead to incompatible statistical inferences with the observed low sample sizes and therefore, low power. Furthermore, the observed association with data analysis platforms of both the p-value distributional classes and the π 0 -s of anti-conservative p-value sets strongly indicates that DE analysis results in the literature are substantially biased by the analysis platform used.

From our meta-science study, we can conclude that the actual use of statistical tools in the HT-seq DE field is inconsistent with a high quality of statistical inferences. While our results clearly show the pervasiveness of the problem, they by themselves cannot answer the question: how much does it matter downstream for the quality of scientific inferences? However, recent methodological work on individual workflows allows us to provide an answer: it matters a lot. Namely, from simulation studies of existing workflows, it is now becoming clear that all popular p-value-based methods that can be used with small samples often fail in the one thing that they were really created for, in FDR control [30,34,36]. Our study sees the results of this widespread failure to control FDR in the high prevalence of low π 0 estimates, whose lack of dependence on sample size defies the logic of statistical inference, thereby confirming that the poor FDR control observed in simulations is massively carried over into results of actual HT-seq DE data analysis. Collectively, our study and the aforementioned simulation studies indicate that the field is largely built on analyzing low-power experiments, which are unlikely to identify actual effects, but still tend to overestimate the effect sizes of both true and false DE-s of any statistically significant results [35], while presenting an unknowable number of false discoveries as statistically significant. It is hard to imagine a worse state of affairs or that this would not substantially affect the quality of relevant scientific conclusions.

In the following, we will further discuss specific aspects of our work.

The meaning of the p-value distribution The first thing to notice with the p-value distributional classes is that there are extremely few uniform p-value distributions, suggesting relatively few actual effects. This unexpected result is made even more surprising by the low statistical power (<40%) of most real-world RNA-seq DE experiments, which should increase the likelihood of encountering uniform p-value distributions [37]. As a technical comment, it should be noted that the class assigned by us on a p-value histogram depends on an arbitrarily set bin size. Our use of 40 bins leads to histograms, where an experiment with up to around 100 true effects out of 20,000 features could well lead to a uniform histogram because of swamping of the lowermost bin (p < 0.025) with p-values emanating from true null effects (S17 Fig). If the p-values had been calculated correctly, then the lack of uniform p-value distributions would indicate that (i) there are almost no experiments submitted to GEO, where the experimental treatment led to only a few DE genes and (ii) that the actual power of GEO-submitted experiments is not low at all. We find both possibilities hard to believe. While there is a positive temporal trend for the increasing fraction of anti-conservative p-value sets, overall, a substantial majority of them fall into shapes indicating that the assumptions behind the statistical tests, which produced these p-values, have not been met. Expressly, p-value distributions that are not anti-conservative/uniform relinquish statistical control of the experiment over type I errors, on the presence of which later p-value adjustments are predicated [21]. Consequently, such p-value sets have effectively relinquished their statistical meaning and are thus problematic for scientific interpretation and further analysis, including FDR/q value calculations. Specifically, conservative p-value histograms are thought to be caused by small samples, indicative of reduced power and overly conservative control of FDR [30]. The conjecture of low power is supported by our data in that experiments with conservative p-value distributions are the most likely to have sample sizes of 1 or 2 (Fig 5B). In contrast, anti-conservative p-value distributions that have more than expected small p-values, and thus led to reduced π 0 estimates in our analysis, tend to lead to loss of FDR control [30]. In general, even a small deviation from the uniformity of the p-value distribution of true nulls is expected to throw off the overall FDR control, which, alas, is the main reason for using these methods for the statistical analysis of DE [30]. In fact, without FDR control, any statistical method that converts continuous effects into statistical significance through binary decisions can do more harm than good [11]. Currently, most DE analysis platforms use parametric tests accompanied by GLM regression. However, it has been recently shown for the often-used parametric tests that their FDR control can be exceedingly poor, while the nonparametric Wilcoxon rank sum test achieves better results [34]. Furthermore, it is becoming clear that data normalization can introduce bias into HT-seq DE analysis, which cannot be corrected by many of the currently widely used methods [38,39]. Indeed, widely used DE analysis tools use RNA-seq data preprocessing strategies, which are vulnerable to situations where a significant fraction of features change expression, and different samples exhibit differing total RNA concentrations [25,40]. In general, the analysis tools used for differential expression testing allow for a plethora of choices, including distributional models, data transformations, and basic analytic strategies, which can lead to different results through different trade-offs [41–43]. Our finding of a high proportion of unruly p-value distributions suggests that the wide variety of individual data preprocessing/DE testing workflows, in their actual use by working scientists, results in an overall poor outcome. Although this situation has been steadily improving over the last decade, at least for the limma, edgeR, and DESeq2 users, there remains substantial room for further improvements.

Possible causes for analysis platform-driven bias What could cause the systematic differences between analysis platforms that we see in p-value distributions and π 0 values? The significant points of divergence in the analytic pathway include raw sequence preprocessing, aligning sequences to a reference sequence, counts normalization, and differential expression testing [42,44]. Our inability to abolish analysis platform-driven bias by removal of low-count features suggests that the pre-filtering of data is not a major source of this bias. Although in principle, any tool can be used in many workflows, different tools tend to offer their users different suggested workflows and different amounts of automation for data preprocessing and analysis. For example, Cuffdiff almost fully automates the analysis, including the removal of biases [45], DESeq2 workflow suggests some preprocessing steps to speed-up computations but uses automation to remove biases from input data [46], whereas edgeR [47] and limma-voom require more interactive execution of separate preprocessing and analysis steps [48,49]. We speculate that the popularity of Cuffdiff and DESeq2 partly lies in their automation, as the user is largely relieved from decision-making. However, we found that cuffdiff is associated with the smallest proportion of anti-conservative p-value histograms, whereas limma and edger, with their more hands-on approach, are associated with the highest proportions of anti-conservative histograms. Interestingly, limma and edgeR use very different distributional models for DE testing, supporting the notion that it might be data preprocessing rather than the statistical test that has the most impact on the success of DE analysis [25]. However, limma and edgeR were not the top performers on the π 0 -metric, where the highest performance is associated with Cuffdiff, whose ability to provide anti-conservative p-value distributions was the least impressive (note that the π 0 -s were calculated only from the experiments with well-distributed p-values).

Power considerations The prevalent sample sizes of 2 to 3 in published HT-seq DE experiments are incompatible with our simulations and the literature in terms of providing enough power to the experiment to reliably find most DE genes. For example, with the currently favored sample size of 3, for most genes in isogenic yeast, effect sizes of at least 4-fold seem to be required for successful analysis, and the overall minimal acceptable sample size can be from 4 to 6 [37,50]. In animals, a reasonable sample size is expected to be substantially larger, over 10 for most genes, which are not highly expressed, and for cancer samples, it seems to be well over 20 [37,43,51–53]. This raises the possibility that nearly all experiments, which should have resulted in uniform distributions, were somehow shifted into other distributional classes. The low power to detect DE of most genes also means that excluding most low-count genes (which means most genes) should be encouraged because it would increase the power to detect the highly expressed genes, while DE of lower-expressed genes would have been missed anyway. Of course, when the scientific goal is to study the DE of genes that are not highly expressed, there is no real alternative to substantially increasing the sample size.

Estimated π 0 —s suggest experimental design problems The DE analysis tool-specific means of π 0 values range from 0.5 (tool “unknown”) to 0.7 (tool “cuffdiff”), showing that by this criterion in an average RNA-seq experiment about half of the cellular RNA-s are expected to change their expression levels (Fig 3B). In principle, a low π 0 could reflect true differential expression as a causal response to experimental treatment, or it could be an artefact of suboptimal data analysis. To further investigate this, we separately analyzed 2 cases of studies related either to cancer or transcription factors, assuming that available p-value sets reflect the situation where DE is profiled in cancer cells/tissues or after transcription factor interrogation, respectively. Due to large-scale rearrangements in cancer genomes and the ability of many TFs to change the expression of many genes, these studies are expected to lead to lower π 0 -s, perhaps in the 0.4 range [31]. However, in both cases, we saw essentially unchanging π 0 distributions, which suggests that the low π 0 -s in our dataset are indicative of problematic analytic workflows. This conclusion is further strengthened by the observed lack of negative correlations between π 0 and sample size, which is a proxy of statistical power. It should be added that as most data normalization workflows assume that most genes are not DE (and that total RNA concentration are stable across experimental conditions), any study that results in a low π 0 should strive to explicitly address this fact in the experimental setup, data analysis, and interpretation of results. It has been argued that most genome-wide DE studies ever conducted, including by HT-seq, have used experimental designs that would make it impossible to untangle such global effects, at least in a quantitatively accurate way [54,55]. The issue lies in the use of internal standards in normalizing the counts for different RNA-s, which leads to wrong interpretations if most genes change expression in 1 direction. To overcome this problem, one could use spike-in RNA standards and compositional normalization [40]. However, spike-in normalization requires great care to work correctly in such extreme cases [39,56], and it is rarely used outside single-cell RNA-seq [40]. Thus, it seems likely that many or most of the low π 0 experiments represent technical failures, most likely during data normalization.

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002007

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/