(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Is N-Hacking Ever OK? The consequences of collecting more data in pursuit of statistical significance [1]

['Pamela Reinagel', 'Department Of Neurobiology', 'School Of Biological Science', 'University Of California San Diego', 'La Jolla', 'California', 'United States Of America']

Date: 2023-11

Abstract Upon completion of an experiment, if a trend is observed that is “not quite significant,” it can be tempting to collect more data in an effort to achieve statistical significance. Such sample augmentation or “N-hacking” is condemned because it can lead to an excess of false positives, which can reduce the reproducibility of results. However, the scenarios used to prove this rule tend to be unrealistic, assuming the addition of unlimited extra samples to achieve statistical significance, or doing so when results are not even close to significant; an unlikely situation for most experiments involving patient samples, cultured cells, or live animals. If we were to examine some more realistic scenarios, could there be any situations where N-hacking might be an acceptable practice? This Essay aims to address this question, using simulations to demonstrate how N-hacking causes false positives and to investigate whether this increase is still relevant when using parameters based on real-life experimental settings.

Citation: Reinagel P (2023) Is N-Hacking Ever OK? The consequences of collecting more data in pursuit of statistical significance. PLoS Biol 21(11): e3002345. https://doi.org/10.1371/journal.pbio.3002345 Published: November 1, 2023 Copyright: © 2023 Pamela Reinagel. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The author received no specific funding for this work. Competing interests: The author has declared that no competing interests exist.

Introduction There has been much concern in recent years about the lack of reproducibility of results in some scientific fields, leading to a call for improved statistical practices [1–5]. The recognition of a need for better education in statistics and greater transparency in reporting is justified and welcome, but rules and procedures should not be applied by rote without comprehension. Experiments often require substantial financial resources, scientific talent, and the use of finite and precious resources; there is therefore an ethical imperative to use these resources efficiently. Thus, to ensure both the reproducibility and efficiency of research, experimentalists need to understand the underlying statistical principles behind the rules. One rule of null hypothesis significance testing is that if a sample size N is chosen in advance, it may not be changed (augmented) after seeing the results [1,6–9]. In my experience, this rule is not well known among biologists and is commonly violated. Many researchers engage in “N-hacking”: incrementally adding more observations to an experiment when a preliminary result is “almost significant.” Indeed, it is not uncommon for reviewers of manuscripts to require that authors collect more data to support a claim if the presented data do not reach significance. Prohibitions against collecting additional data are therefore met with considerable resistance and confusion by the research community. So, what is the problem with N-hacking? What effects does it have on the reliability of a study’s results and are there any scenarios where its use might be acceptable? In this Essay, I aim to address these questions using simulations representing different experimental scenarios (Box 1) and discuss the implications of the results for experimental biologists. I am not claiming or attempting to overturn any established statistical principles; yet, although there is nothing theoretically new here, the numerical results may be surprising, even for those familiar with the theoretical principles at play. Box 1. Simulation details The specific sampling heuristic simulated in this Essay is meant to be descriptive of practice and is different in details from established formal adaptive sampling methods [6,10–12]. The simulations can be taken to represent a large number of independent studies, each collecting separate samples to test a different hypothesis. All simulations were performed in MATLAB 2018a. Definitions of all terms and symbols are summarized in S1 Appendix. The MATLAB code for all these simulations and more can be found in [13], along with the complete numeric results of all computationally intensive simulations.

Implications of the simulation results Many researchers are unaware that it matters when or how they decide how much data to collect when testing for an effect. The first take home message from this Essay is that if you are reporting p values, it does matter. Increasing the sample size after obtaining a nonsignificant p value will on average lead to a higher rate of false positives, if the null hypothesis is true. This has been said many times before, but most authors warn that this practice will lead to extremely high false positive rates [6–9]. This certainly can occur, if a researcher were to increment their sample size no matter how far from α the p value was and continue to collect data until N was quite large (Fig 1). But I have personally never met an experimental biologist who would do that. If extra data were only collected if the p value were quite close to α, then the effects on the false positive rate would be modest and bounded. The magnitude of the increase in the false positive rate depends quantitatively on the initial sample size (N init ), the significance criterion (α), the promising zone or eligibility window (w), and the increment size (N incr ). In the previous section, I provide an intuitive explanation and empirical validation for an upper bound on the false positive rate. Moreover, sample augmentation strictly increases the PPV achievable for any given statistical power compared to studies that strictly adhere to the initially planned N; an outcome that remained true for both underpowered and well-powered regimes. To my knowledge, this particular sampling procedure has not been considered before, but the basic principles underlying the benefits of adaptive sampling have long been known in the field of statistics [15]. In the literature, optional stopping of an experiment or N-hacking has often been flagged as an important cause of irreproducible results. But in some regimes, uncorrected data-dependent sample augmentation could increase both statistical power and PPV relative to a fixed-N procedure of the same nominal α. Therefore, in research fields that operate in that restricted regime, it is simply not true that N-hacking would lead to an increased risk of unreproducible results. A verdict of “statistical significance” reached in this manner is if anything more likely to be reproducible than results reached by fixed-N experiments with the same sample size, even if no correction is applied for sequential sampling or multiple comparisons. Therefore, if any research field operating in that parameter regime has a high rate of false claims, other factors are likely to be responsible. Some caveats I have asserted that certain practices are common based on my experience, but I have not done an empirical study to support this claim. Moreover, I have simulated only one “questionable” practice: post hoc sample augmentation based on an interim p value. I have seen this done to rescue a nonsignificant result, as simulated here, but I have also seen it done to verify a barely significant one (a practice which results in FP 0 <α). In other contexts, I suspect researchers flexibly decide when to stop collecting data on the basis of directly observed results or visual inspection of plots, without interim statistical tests. Such decisions may take into account additional factors such as the absolute effect size, a heuristic which could have even more favorable performance characteristics [16]. From a metascience perspective, a comprehensive study of how researchers make sampling decisions in different disciplines (biological or otherwise), coupled with an analysis of how the observed operating heuristics would impact reproducibility, would be quite interesting [17]. In this Essay, I have discussed the effect of N-hacking on type I errors (false positives) and type II errors (false negatives). Statistical procedures may also be evaluated for errors in effect size estimation: type M (magnitude) and type S (sign) errors [18]. Even in a fixed-N experiment, effect sizes estimated from “significant” results are systematically overestimated. This bias can be quite large when N is small. This concern also applies to the low-N experiments described here, but sample augmentation does not increase either the type M or type S error compared to fixed-N experiments [13]. What about batch effects? It is often necessary to collect data in batches and/or over long time periods for pragmatic reasons. Differences between batches or over time can be substantial sources of variability, even when using a fixed-N procedure. Therefore, one should check if there are batch or time-varying effects and account for them in the analysis if necessary. This is not unique to N-hacking, but with incremental sample augmentation this concern will always be applicable. Likewise, if the experimental design is hierarchical, a hierarchical model is needed, regardless of sampling procedure [19]. I have simulated a balanced experimental design, with the same N in both groups in the initial batch, with the sample size of both groups being augmented equally in each sample augmentation step. This is recommended, especially in multi-factorial designs with many groups, as it minimizes the risk of confounding batch effects with the effects under study. Moreover, selectively augmenting the sample size in some groups but not others can introduce other confounds and interpretation complexities [20].

So, is N-hacking ever OK? Researchers today are being told that if they have obtained a nonsignificant finding with a p value just above α, it would be a “questionable research practice” or even a breach of scientific ethics to add more observations to their data set to improve statistical power. Nor may they describe the result as “almost” or “bordering on” significant. They must either run a completely independent larger-N replication or fail to reject the null hypothesis. Unfortunately, in the current publishing climate, this generally means relegation to the file drawer. Depending on the context, there may be better options. In the following discussion, I use the term “confirmatory” to mean a study designed for a null hypothesis significance test, intended to detect effects supported by p values or “statistical significance.” I use the term “non-confirmatory” as an umbrella term to refer to all other kinds of empirical research. While some have used the term “exploratory” for this meaning [21–23], their definitions vary, and the word “exploratory” already has other specific meanings in this context [24,25], making this terminology more confusing than helpful [26,27]. An ideal confirmatory study would completely prespecify the sample size or sampling plan and every other aspect of the study design, and furthermore, establish that all null model assumptions are exactly true and all potential confounds are avoided or accounted for. This ideal is unattainable in practice. Therefore, real confirmatory studies fall along a continuum from very closely approaching this ideal, to looser approximations. A very high bar is appropriate when a confirmatory experiment is intended to be the sole or primary basis of a high-stakes decision, such as a clinical trial to determine if a drug should be approved. At this end of the continuum, the confirmatory study should be as close to the ideal as humanly possible, and public preregistration is reasonably required. The “p value” obtained after unplanned incremental sampling is not a valid p value, because without a prespecified sampling plan, you can never truly know or prove what you would have done if the data had been otherwise, so there is no way to know how often a false positive would have been found by chance. N-hacking forfeits control of the type I error rate, whether the false positive rate is increased or decreased thereby. Therefore, in a strictly confirmatory study, N-hacking is not OK. That being said, planned incremental sampling is not N-hacking. There are many established adaptive sampling procedures that allow flexibility in when to stop collecting data, while still producing rigorous p values. These methods are widely used in clinical trials, where costs, as well as stakes, are very high. It is beyond the present scope to review these methods, but see [6,10–12] for more information. Simpler, or more convenient, prespecified adaptive sampling schemes are also valid, even if they are not optimal [8]. In this spirit, the sampling heuristic I simulated could be followed as a formal procedure (S2 Appendix). A less-perfect confirmatory study is often sufficient in lower-stakes conditions, such as when results are intended only to inform decisions about subsequent experiments, and where claims are understood as contributing to a larger body of evidence for a conclusion. In this research context, transparent N-hacking in a mostly prespecified study might be OK. Although data-dependent sample augmentation will prevent determination of an exact p value, the researchers may still be able to estimate or bound the p value (see S2 Appendix). When such a correction is small and well justified, this imperfection might be on a par with others we routinely accept, such as assumptions of the statistical test that cannot be confirmed or which are only approximately true. In my opinion, it is acceptable to report a p value in this situation, as long as there is full disclosure. The report should state that unplanned sample augmentation occurred, report the interim N and p values, describe the basis of the decision as honestly as possible, and provide and justify the authors’ best or most conservative estimate of the p value. With complete transparency (including publication of the raw data), readers of the study can decide what interpretation of the data is most appropriate for their purposes, including relying only on the initial, strictly confirmatory p value, if that standard is most appropriate for the decision they need to make. However, many high-quality research studies are mostly or entirely non-confirmatory, even if they follow a tightly focused trajectory or are hypothesis (theory) driven. For example, “exploratory experimentation” aims to describe empirical regularities prior to formulation of any theory [25]. Development of a mechanistic or causal model may proceed through a large number of small (low-power) experiments [28,29], often entailing many “micro-replications” [30]. In this type of research, putative effects are routinely re-tested in follow-up experiments or confirmed by independent means [31–34]. Flexibility may be essential to efficient discovery in such research, but the interim decisions about data collection or other aspects of experimental design may be too numerous, qualitative, or implicit to model. In this kind of research, the use of p values is entirely inappropriate; however, this does not mean abandoning statistical analysis or quantitative rigor. Non-confirmatory studies can use other statistical tools, including exploratory data analysis [24] and Bayesian statistics [35]. Unplanned sample augmentation is specifically problematic for p values; other statistical measures do not have the same problem (for an example, compare Fig 1 to S1 Fig) [36,37]. Therefore, in transparently non-confirmatory research, unplanned sample augmentation is not even N-hacking. If a sampling decision heuristic of the sort simulated here were employed, researchers would not need to worry about producing an avalanche of false findings in the literature. A common problem in biology is that many non-confirmatory studies report performative p values and make “statistical significance” claims, not realizing that this implies and requires prospective study design. It is always improper to present a study as being prospectively designed when it was not. To improve transparency, authors should label non-confirmatory research as such, and be able to do so with no stigma attached. Journals and referees should not demand reporting of p values or “statistical significance” in such studies, and authors should refuse to provide them. Where to draw the boundary between approximately confirmatory and non-confirmatory research remains blurry. My own opinion is that it is better to err on the side of classifying research non-confirmatory, and reserve null hypothesis significance tests and p values for cases where there is a specific reason a confirmatory test is required.

Conclusions In this Essay, I used simulations to demonstrate how N-hacking can cause false positives and showed that, in a parameter regime relevant for many experiments, the increase in false positives is actually quite modest. Moreover, results obtained using such moderate sample augmentation have a higher PPV than non-incremented experiments of the same sample size and statistical power. In other words, adding a few more observations to shore up a nearly significant result can increase the reproducibility of results. For strictly confirmatory experiments, N-hacking is not acceptable, but many experiments are non-confirmatory, and for these, unplanned sample augmentation with reasonable decision rules would not be likely to cause rampant irreproducibility. In the pursuit of improving the reliability of science, we should question “questionable” research practices, rather than merely denounce them [38–47]. We should also distinguish practices that are inevitably severely misleading [48–50] from ones that are only a problem under specific conditions, or that have only minor ill effects. A quantitative, contextual exploration of the consequences of a research practice is more instructive for researchers than issuing a blanket injunction. Such thoughtful engagement can lead to more useful suggestions for improved practice of science or may reveal that the goals and constraints of the research are other than what was assumed.

Acknowledgments I am grateful to Hal Pashler, Casper Albers, Daniel Lakens, and Steve Goodman for valuable discussions and helpful comments on earlier drafts of the manuscript.

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002345

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/