(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
Statistical simulations show that scientists need not increase overall sample size by default when including both sexes in in vivo studies [1]
['Benjamin Phillips', 'Data Sciences', 'Quantitative Biology', 'Discovery Sciences', 'R D', 'Astrazeneca', 'Cambridge', 'United Kingdom', 'Timo N. Haschler', 'Bioscience Renal']
Date: 2023-06
Abstract In recent years, there has been a strong drive to improve the inclusion of animals of both sexes in the design of in vivo research studies, driven by a need to increase sex representation in fundamental biology and drug development. This has resulted in inclusion mandates by funding bodies and journals, alongside numerous published manuscripts highlighting the issue and providing guidance to scientists. However, progress is slow and barriers to the routine use of both sexes remain. A frequent, major concern is the perceived need for a higher overall sample size to achieve an equivalent level of statistical power, which would result in an increased ethical and resource burden. This perception arises from either the belief that sex inclusion will increase variability in the data (either through a baseline difference or a treatment effect that depends on sex), thus reducing the sensitivity of statistical tests, or from misapprehensions about the correct way to analyse the data, including disaggregation or pooling by sex. Here, we conduct an in-depth examination of the consequences of including both sexes on statistical power. We performed simulations by constructing artificial datasets that encompass a range of outcomes that may occur in studies studying a treatment effect in the context of both sexes. This includes both baseline sex differences and situations in which the size of the treatment effect depends on sex in both the same and opposite directions. The data were then analysed using either a factorial analysis approach, which is appropriate for the design, or a t test approach following pooling or disaggregation of the data, which are common but erroneous strategies. The results demonstrate that there is no loss of power to detect treatment effects when splitting the sample size across sexes in most scenarios, providing that the data are analysed using an appropriate factorial analysis method (e.g., two-way ANOVA). In the rare situations where power is lost, the benefit of understanding the role of sex outweighs the power considerations. Additionally, use of the inappropriate analysis pipelines results in a loss of statistical power. Therefore, we recommend analysing data collected from both sexes using factorial analysis and splitting the sample size across male and female mice as a standard strategy.
Citation: Phillips B, Haschler TN, Karp NA (2023) Statistical simulations show that scientists need not increase overall sample size by default when including both sexes in in vivo studies. PLoS Biol 21(6): e3002129.
https://doi.org/10.1371/journal.pbio.3002129 Academic Editor: Marcus Munafò, University of Bristol, UNITED KINGDOM Received: January 30, 2023; Accepted: April 18, 2023; Published: June 8, 2023 Copyright: © 2023 Phillips et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: The R scripts used to generate the data and figures and analyse the in vivo data have been made available as a Zenodo repository (
https://doi.org/10.5281/zenodo.7806724). Funding: The authors received no specific funding for this work. Competing interests: I have read the journal’s policy and the authors of this manuscript have the following competing interests: BP, TH, and NAK have shareholdings in AstraZeneca.
Introduction There has been a bias towards using a single sex in in vivo research. Though there is variation between subdisciplines, this strategy has tended to result in a heavy bias towards male animals. For example, in 2009, only 26% of studies used both sexes and among the remainder there was a male bias in 80% of studies [1]. The negative consequences of these shortcomings on scientific enterprise are beginning to be better understood as evidence emerges that our current fundamental biological knowledge base may be biased. For example, a recent report concluded that the fundamental molecular basis of pain is highly sex dimorphic, yet much of our knowledge in this area is derived from studies solely using male organisms [2]. This situation risks generating a knowledge imbalance that persists through the research pipeline, ultimately manifesting in the clinic. To improve the translation of results from animals to humans, there has been a push to include both male and female animals in in vivo studies. For example, numerous funding bodies, including the NIH in the United States and the MRC in the United Kingdom, now have inclusion mandates. These policies do not require scientists to study differences between males and females per se, but instead aim to improve the generalisability of studies by calculating an average effect estimated from both sexes. If, however, there is a large, meaningful sex difference in the treatment effect, studies should be designed in such a way that the visualisation and analysis detects it [3,4]. The NIH policy introduced the term Sex as a Biological Variable (SABV), and, here, we use the term to represent a sex inclusive research philosophy that emphasises the importance of automatic inclusion, with a focus on treatment effect estimates. Any of a wide range of factors including animal strain, age, health status, or others could also be the focus of a campaign to improve research generalisability. However, sex is a particularly pressing and timely direction for improved representation since females account for such a large proportion of the population of interest but are currently largely overlooked. Over time, the proportion of studies including both sexes has improved [5,6], with one study estimating an increase between 2009 and 2019 from 26% to 48% of studies [6]. Scientists tend to be supportive of efforts to improve sex representation in in vivo research [7]. Unfortunately, in studies where both sexes are tested, a large proportion commit errors at the statistical analysis stage [8]. Thus, despite an overall increase in inclusion, the proportion of studies appropriately interpreting the influence of sex is still low [5]. Overall, the pace of change is slow, owing to a persistent and broad range of perceived statistical and practical barriers. Consequently, scientists believe that including both sexes will introduce a significant ethical, practical, and financial burden [9]. The barriers include now debunked beliefs that female animals produce more variable data [10–12], institution-level ingrained cultural belief about the value of studying 2 sexes [13,14], and a skill-gap in handling data collected under factorial designs [8]. There is also a general belief that it is necessary to greatly increase the experimental sample size (N) when investigating treatment effects in 2 sexes [14,15]. For example, a recent report cites that 27% of published papers justified a single sex approach due to concerns around experimental variability [6]. Though this misconception has been addressed previously [11,16], and guidance on appropriate analysis exists [17,18], it remains widespread, and there is a need for a deeper exploration of the impact of including both sexes on statistical power. Revisiting this is critical to enable the community to address this significant barrier to sex-inclusive research due to the misguided belief that there is a trade-off between pursuing the 3Rs (replacement, reduction, refinement) by means of reducing animal usage on the one hand and designing more generalisable studies on the other [14]. When considering sex-inclusive research, the following misconceptions and data analysis errors have been reported: Misconception 1: Designs that include both sexes will require a doubling of sample size to achieve the same power [9,14,19].
Misconception 2: Belief that the possibility of sex effects (either a baseline differences or a treatment effect that depends on sex) will increase variability and consequently require an increased N to maintain the power [11,12].
Error 1: Inappropriate pooling of male and female data for a treatment group (i.e., combining the data from both sexes and ignoring sex as a factor in the analysis) [8].
Error 2: Disaggregation of the data by sex and independent statistical comparison between the control and treated group. Then, comparing the p-values from the independent tests [8,20].
Error 3: Incorrect groups in statistical comparison: comparison of treated males and treated females [6]. Of these, the misconceptions (1 and 2) contain empirical claims, and for these we have constructed a range of simulated datasets to extensively test statistical power for a range of plausible biological scenarios where both sexes are tested and subsequently analysed by an appropriate factorial pipeline. The consequences of pooling male and female data and disaggregating the data by sex (error numbers 1 and 2) are additionally demonstrated as part of these simulations. A case study example analysis (S1) explores error 3. Through simulations, we have conducted an in-depth examination of the impact of sex inclusion on statistical power for a variety of commonly implemented analysis strategies. Our methodology has been designed to demonstrate the problems that result from using the wrong analysis strategies and address common misconceptions around power when using factorial designs and analysis. Our results demonstrate that including both male and female animals does not reduce statistical power across a wide range of outcomes when investigating a treatment effect. We demonstrate that power loss is (a) rare and (b) indicative of a sex dimorphism that it would be important to be aware of. Our comparison of statistical analysis strategies demonstrates that inappropriate methods, including both pooling the data from males and females and disaggregation of the data by sex result in a loss of statistical power. Therefore, the importance of adopting a factorial analysis method is central to appropriately analyse data from studies testing a treatment effect in both sexes. To support the adoption of appropriate analysis, we provide a case study (S1 Case Study) demonstrating an example pipeline for analysing data collected under such designs, intended as a practical guide for scientists. A compendium of common statistical terms used within this manuscript is also included as a guide to readers (see Box 1 Glossary).
Conclusions The SABV philosophy is to include both sexes to allow a generalisable estimate with the potential to detect if there is a major difference in treatment response between sexes. This inclusive design, when combined with a factorial analysis, allows us to statistically test for whether the treatment effect depends on sex. This is driven by a desire to increase the translational confidence of the results and does not require experiments to be powered to detect treatment by sex effects. Our simulations show that for most biologically expected situations (where treatment effects are similar across the sexes or there is a baseline sex difference), there is no need to increase the N needed in the studies, rather the intended N can be shared across the 2 sexes for a treatment. This strategy implies that power calculations for the treatment effect can be simplified to a 2-group comparison to estimate the total N needed for a treatment, which is then shared between the 2 sexes. Alternatively, power calculations for factorial designs can be calculated using other methods (e.g., the Superpower package [25]). When there is a small difference in the size of the treatment by sex interaction, estimating the average effect is ideal for translational understanding as this is a generalisable conclusion. The simulations did demonstrate a loss of power when there is a large treatment by sex interaction (e.g., opposite or the effect only occurs in 1 sex). In this uncommon scenario and where the N has been split across the 2 sexes, we may fail to detect a significant treatment effect (due to the lower power) but would gain important knowledge suggestive of a large sex dimorphism. Taken together, these conclusions support the recommendation to split the intended N across the 2 sexes. Our simulations additionally reveal the negative consequences of erroneously pooling or disaggregating the data by sex for analysis. For pooling, our results show that there is a loss of power to detect a treatment effect across most scenarios, including where there is a baseline sex difference and when there is a treatment by sex interaction. Pooling the data by sex also necessarily precludes identification of sex-specific effects, thus important biological knowledge would be lost in these scenarios. For disaggregation, when both sexes display a treatment effect, there is less power to detect it in each sex independently than via the main treatment effect term of a factorial analysis. Moreover, disaggregating precludes the detection of interaction effects, thus losing the ability to statistically assert a differential treatment by sex effect. There are limitations of the simulations that we have carried out, and consequently, the conclusions we have reached in this manuscript. First, the simulations have been conducted on typical research scenarios aimed at determining a change in means between 2 groups and model continuous data with a normal distribution, equal variance, independent observations, and a balanced design. Where the goal of the study is not to test a mean difference in this context, our conclusions may not apply. The conclusions may not extend to a situation where the n is extremely low, and this may also preclude a halving strategy (e.g., halving of 3 per treatment groups is unfeasible). Our investigations are also conducted in the context of a null hypothesis significance testing scenario where a decision is based on evaluating the p-value against a threshold (most typically p < 0.05). This is an area of significant ongoing discussion and debate [26]. However, given the current prevalence of p-values and their necessity for standard power calculations, we would argue that this limitation has minimal impact. Critically, the scope of this manuscript is limited to studies where the intention is to estimate a generalisable treatment effect, rather than exploring the dependence of a treatment effect on sex. To facilitate progress in sex inclusion in in vivo research, it is crucial to provide scientists with both evidence and practical resources that challenge the barriers that currently stand in the way of change. This manuscript provides an in-depth analysis to explore the topic of power when both sexes are included to address the barrier that is the belief that inclusion requires an increase in the sample size. In this analysis, we have performed extensive statistical simulations to evaluate power under a range of common biological scenarios when splitting the N across 2 sexes. Critically, we did not identify any common scenarios that result in a loss of power to detect treatment effects. Rarely, large interactions in the data may produce an appreciable decrease in treatment effect power. In these scenarios, we would argue the knowledge gained that the treatment has a differential impact between sexes outweighs the statistical loss of power. Furthermore, if a disease affects both sexes but the effect in the research model is observed in only one, this may bring into question the validity of the model or the generalisability of the treatment. The simulations also demonstrate the pitfalls of some frequent analysis mistakes, including the inappropriate pooling and disaggregation of data collected from 2 sexes, which result in a loss of power to detect a treatment effect compared to a factorial analysis applied to the same data. Additionally, we provide an example pipeline for analysing data collected from both sexes as a practical guide for scientists (S1 Case Study). The approaches above heavily depend upon the appropriate application of the correct factorial analysis methods, and it is therefore critical that laboratory scientists receive focused support in developing their statistical capabilities.
Methods Ethics statement All animal experiments were conducted in accordance with the United Kingdom Animal (Scientific Procedures) Act 1986 and associated guidelines, approved by institutional ethical review committees (Alderley Park Animal Welfare and Ethical Review Board; Babraham Institute Animal Welfare and Ethical Review Board) and conducted under the authority of the Home Office Project Licences (PF344F0A0). All animal facilities have been approved by the United Kingdom Home Office Licensing Authority and meet all current regulations and standards of the United Kingdom. Statistical simulations: Dataset construction To explore the impact of sex either as a main effect (baseline sex differences) or when it interacts with the treatment on power, simulation studies were conducted. In the simulations, representative datasets with 5 animals per treatment group per sex were constructed by randomly sampling from a normal distribution after defining the mean and standard deviation of each treatment group. This process was repeated multiple times for each scenario of interest (N = 1,000) that enabled the subsequent evaluation of statistical power for each analysis pipeline of interest. We moved stepwise through 4 scenarios encompassing possible outcomes from studies testing a treatment effect in both sexes. Thus, the simulations differed by altering the specified means in each group (e.g., baseline sex difference, treatment by sex interaction) (see Table 1). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Table 1. Details of how the simulation datasets were constructed to represent various biological situations.
https://doi.org/10.1371/journal.pbio.3002129.t001 Statistical simulations: Statistical analysis The constructed datasets were statistically analysed either using a factorial pipeline or a pooled pipeline. In the factorial pipeline, a regression analysis in R equivalent to a two-way ANOVA provided an assessment of the main effect of treatment, sex, and the interaction of treatment by sex. This was followed by a set of uncorrected post hoc pairwise tests between untreated and treated data in each sex using the R package emmeans. Sex is an unusual biological factor, as it depends on mendelian randomisation rather than experimenter randomisation in the allocation. The scenario we are considering, where both sex and treatment are included in the experimental design, may also be referred to as a stratified design, particularly in the clinical literature [27]. Here, we use the term factorial throughout as this is the typical terminology used by the biological community. In the pooled pipeline, a Student’s t test, after combining the data across the sexes for each treatment level, was conducted Statistical simulations: Assessment of statistical power. The resulting datasets were analysed using both the factorial and pooled pipelines, and statistical power was defined as the proportion of the time a statistically significant effect was called for the model term of interest, at a significance threshold p < 0.05. Box 1. Glossary. Common statistical terms used within this manuscript, adapted from [28] and placed in the context of in vivo research are detailed below: Effect size: Quantitative measure of differences between groups or strength of relationships between variables. Factor: Factors are independent categorical variables that the experimenter controls during an experiment in order to determine their effect on the outcome variable. Example factors include sex or treatment. Factorial design: An experimental design that is used to study 2 or more factors, each with multiple discrete possible values or levels. Independent variable: A variable that either the experimenter controls (e.g., treatment given or dose) or is a property of the sample (sex) or a technical feature (e.g., batch or cage) that can potentially affect the outcome variable. Interaction effect: When the effect of one independent variable (factor) depends on the level of another. For example, the observed treatment effect depends on the sex of the animals. Levels: Are the values that the factor can take. For example, for the factor sex the levels are male and female. Main effect: A main effect is the overall effect of one independent variable on the outcome variable averaging across the levels of the other independent variable. Outcome variable: A variable captured during a study to assess the effects of a treatment. Also known as dependent variable or response variable. Power: For a predefined, biologically meaningful effect size, the probability that the statistical test will detect the effect if it exists (i.e., the null hypothesis is rejected correctly). Can also be called sensitivity. Treatment: A process or action that is the focus of the experiment. For example, a drug treatment or a genetic modification. Sex as a biological variable (SABV): The research philosophy that emphasises the importance of including both sexes in in vivo studies in such a way that a generalisable treatment effect is detectable. Critically, sex should be treated as a variable of primary biological interest. There is no requirement to prospectively power a study to detect a baseline difference between the sexes or treatment by sex interaction, but studies will detect large differences where they exist.
Supporting information S1 Case Study. Example application of factorial analysis for data collected from 2 sexes. This supplementary case study is intended as an example pipeline for analysing data from in vivo experiments collected from both sexes. It is not intended as an exhaustive tutorial, and there are other appropriate methods for analysing the type of data that we are presenting.
https://doi.org/10.1371/journal.pbio.3002129.s001 (DOCX)
Acknowledgments We are grateful to Lorraine Miller for technical assistance with the experimental work and Chris Heath and Esther Pearl for helpful comments on earlier drafts of the manuscript.
[END]
---
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002129
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/