(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

An empirical appraisal of eLife’s assessment vocabulary [1]

['Tom E. Hardwicke', 'Melbourne School Of Psychological Sciences', 'University Of Melbourne', 'Melbourne', 'Sarah R. Schiavone', 'Beth Clarke', 'Simine Vazire']

Date: 2024-08

Abstract Research articles published by the journal eLife are accompanied by short evaluation statements that use phrases from a prescribed vocabulary to evaluate research on 2 dimensions: importance and strength of support. Intuitively, the prescribed phrases appear to be highly synonymous (e.g., important/valuable, compelling/convincing) and the vocabulary’s ordinal structure may not be obvious to readers. We conducted an online repeated-measures experiment to gauge whether the phrases were interpreted as intended. We also tested an alternative vocabulary with (in our view) a less ambiguous structure. A total of 301 participants with a doctoral or graduate degree used a 0% to 100% scale to rate the importance and strength of support of hypothetical studies described using phrases from both vocabularies. For the eLife vocabulary, most participants’ implied ranking did not match the intended ranking on both the importance (n = 59, 20% matched, 95% confidence interval [15% to 24%]) and strength of support dimensions (n = 45, 15% matched [11% to 20%]). By contrast, for the alternative vocabulary, most participants’ implied ranking did match the intended ranking on both the importance (n = 188, 62% matched [57% to 68%]) and strength of support dimensions (n = 201, 67% matched [62% to 72%]). eLife’s vocabulary tended to produce less consistent between-person interpretations, though the alternative vocabulary still elicited some overlapping interpretations away from the middle of the scale. We speculate that explicit presentation of a vocabulary’s intended ordinal structure could improve interpretation. Overall, these findings suggest that more structured and less ambiguous language can improve communication of research evaluations.

Citation: Hardwicke TE, Schiavone SR, Clarke B, Vazire S (2024) An empirical appraisal of eLife’s assessment vocabulary. PLoS Biol 22(8): e3002645. https://doi.org/10.1371/journal.pbio.3002645 Academic Editor: Christopher D. Chambers, Cardiff University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND Received: April 7, 2024; Accepted: July 9, 2024; Published: August 22, 2024

Note: As this is a Preregistered Research Article, the study design and methods were peer-reviewed before data collection. The time to acceptance includes the experimental time taken to perform the study. Learn more about Preregistered Research Articles. Copyright: © 2024 Hardwicke et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data, materials, and analysis scripts are publicly available on the Open Science Framework (https://osf.io/mw2q4/files/osfstorage/). A reproducible version of the manuscript and associated computational environment is available in a Code Ocean container (https://doi.org/10.24433/CO.4128032.v1). Funding: This study was supported by funding awarded to to SV and TEH from the Melbourne School of Psychological Sciences, University of Melbourne. The funders did not play any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: SV is a member of the board of directors of The Public Library of Science (PLOS). This role has in no way influenced the outcome or development of this work or the peer-review process, nor does it alter our adherence to PLOS Biology policies on sharing data and materials. All other authors declare they have no conflicts of interest.

Introduction Peer review is usually a black box—readers only know that a research paper eventually surpassed some ill-defined threshold for publication and rarely see the more nuanced evaluations of the reviewers and editor [1]. A minority of journals challenge this convention by making peer review reports publicly available [2]. One such journal, eLife, also accompanies articles with short evaluation statements (“eLife assessments”) representing the consensus opinions of editors and peer reviewers [3]. In 2022, eLife stated that these assessments would use phrases drawn from a common vocabulary (Table 1) to convey their judgements on 2 evaluative dimensions: (1) “significance”; and (2) “strength of support” (for details see [4]). For example, a study may be described as having “landmark” significance and offering “exceptional” strength of support (for a complete example, see Box 1). The phrases are drawn from “widely used expressions” in prior eLife assessments and the stated goal is to “help convey the views of the editor and the reviewers in a clear and consistent manner” [4]. Here, we report a study which assessed whether the language used in eLife assessments is perceived clearly and consistently by potential readers. We also assessed alternative language that may improve communication. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Phrases and their definitions (italicised) from the eLife vocabulary representing 2 evaluative dimensions: significance and strength of support. The significance dimension is represented by 5 phrases and the strength of support dimension is represented by 6 phrases. In a particular eLife assessment, readers only see 1 phrase from each of the evaluative dimensions. Phrases are accompanied by eLife definitions, but these are not shown in eLife assessments (though some words from the definitions may be used). https://doi.org/10.1371/journal.pbio.3002645.t001 Our understanding (based on [4]) is that eLife intends the common vocabulary to represent different degrees of each evaluative dimension on an ordinal scale (e.g., “landmark” findings are more significant than “fundamental” findings and so forth); however, in our view the intended ordering is sometimes ambiguous or counterintuitive. For example, it does not seem obvious to us that an “important” study is necessarily more significant than a “valuable” study nor does a “compelling” study seem necessarily stronger than a “convincing” study. Additionally, several phrases like “solid” and “useful,” could be broadly interpreted, leading to a mismatch between intended meaning and perceived meaning. The phrases also do not cover the full continuum of measurement and are unbalanced in terms of positive and negative phrases. For example, the “significance” dimension has no negative phrases—the scale endpoints are “landmark” and “useful.” We also note that the definitions provided by eLife do not always map onto gradations of the same construct. For example, the eLife definitions of phrases on the significance dimension suggest that the difference between “useful,” “valuable,” and “important” is a matter of breadth/scope (whether the findings have implications beyond a specific subfield), whereas the difference between “fundamental” and “landmark” is a matter of degree. In short, we are concerned that several aspects of the eLife vocabulary may undermine communication of research evaluations to readers. In Table 2, we outline an alternative vocabulary that is intended to overcome these potential issues with the eLife vocabulary. Phrases in the alternative vocabulary explicitly state the relevant evaluative dimension (e.g., “support”) along with a modifying adjective that unambiguously represents degree (e.g., “very low”). The alternative vocabulary is intended to cover the full continuum of measurement and be balanced in terms of positive and negative phrases. We have also renamed “significance” to “importance” to avoid any confusion with statistical significance. We hope that these features will facilitate alignment of readers’ interpretations with the intended interpretations, improving the efficiency and accuracy of communication. Box 1. A complete example of an eLife assessment. This particular example uses the phrase “important,” to convey the study’s significance, and the phrase “compelling,” to convey the study’s strength of support “The overarching question of the manuscript is important and the findings inform the patterns and mechanisms of phage-mediated bacterial competition, with implications for microbial evolution and antimicrobial resistance. The strength of the evidence in the manuscript is compelling, with a huge amount of data and very interesting observations. The conclusions are well supported by the data. This manuscript provides a new co-evolutionary perspective on competition between lysogenic and phage-susceptible bacteria that will inform new studies and sharpen our understanding of phage-mediated bacterial co-evolution.” [5]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 2. Phrases from the alternative vocabulary representing 2 evaluative dimensions: importance and strength of support. Each dimension is represented by 5 phrases. https://doi.org/10.1371/journal.pbio.3002645.t002 The utility of eLife assessments will depend (in part) on whether readers interpret the common vocabulary in the manner that eLife intends. Mismatches between eLife’s intentions and readers’ perceptions could lead to inefficient or inaccurate communication. In this study, we empirically evaluated how the eLife vocabulary (Table 1) is interpreted and assessed whether an alternative vocabulary (Table 2) elicited more desirable interpretations. Our goal was not to disparage eLife’s progressive efforts, but to make a constructive contribution towards a more transparent and informative peer review process. We hope that a vocabulary with good empirical performance will be more attractive and useful to other journals considering adopting eLife’s approach. Our study is modelled on prior studies that report considerable individual differences in people’s interpretation of probabilistic phrases [6–12]. In a prototypical study of this kind, participants are shown a probabilistic statement like “It will probably rain tomorrow” and asked to indicate the likelihood of rain on a scale from 0% to 100%. Analogously, in our study participants read statements describing hypothetical scientific studies using phrases drawn from the eLife vocabulary or the alternative vocabulary and were asked to rate the study’s significance/importance or strength of support on a scale from 0 to 100. We used these responses to gauge the extent to which people’s interpretations of the vocabulary were consistent with each other and consistent with the intended rank order. Research aims Our overarching goal was to identify clear language for conveying evaluations of scientific papers. We hope that this will make it easier for other journals/platforms to follow in eLife’s footsteps and move towards more transparent and informative peer review. With this overall goal in mind, we had 3 specific research aims: Aim One. To what extent do people share similar interpretations of phrases used to describe scientific research?

Aim Two. To what extent do people’s (implied) ranking of phrases used to describe scientific research align with (a) each other; and (b) with the intended ranking?

Aim Three. To what extent do different phrases used to describe scientific research elicit overlapping interpretations and do those interpretations imply broad coverage of the underlying measurement scale?

Discussion Research articles published in eLife are accompanied by evaluation statements that use phrases from a prescribed vocabulary (Table 1) to describe a study’s importance (e.g., “landmark”) and strength of support (e.g., “compelling”). If readers, reviewers, and editors interpret the prescribed vocabulary differently to the intended meaning, or inconsistently with each other, it could lead to miscommunication of research evaluations. In this study, we assessed the extent to which people’s interpretations of the eLife vocabulary are consistent with each other and consistent with the intended ordinal structure. We also examined whether an alternative vocabulary (Table 2) improved consistency of interpretation. Overall, the empirical data supported our initial intuitions: while some phrases in the eLife vocabulary were interpreted relatively consistently (e.g., “exceptional” and “landmark”), several phrases elicited broad interpretations that overlapped a great deal with other phrases’ interpretation (particularly the phrases “fundamental,” “important,” and “valuable” on the significance/importance dimension (Fig 2) and “compelling,” “convincing,” and “useful” on the strength of support dimension (Fig 3)). This suggests these phrases are not ideal for discriminating between studies with different degrees of importance and strength of support. If the same phrases often mean different things to different people, there is a danger of miscommunication between the journal and its readers. Responses on the significance/importance dimension were largely confined to the upper half of the scale, which is unsurprising, given the absence of negative phrases. It is unclear if the exclusion of negative phrases was a deliberate choice on the part of eLife’s leadership (because articles with little importance would not be expected to make it through editorial triage) or an oversight. Most participants’ implied rankings of the phrases were misaligned with the ranking intended by eLife—20% of participants had aligned rankings on the significance/importance dimension and 15% had aligned rankings on the strength of support dimension (Fig 4). The degree of mismatch was typically in the range of 1 or 2 discordant ranks (Fig 5). Heat maps (Fig 6) highlighted that phrases in the middle of scale (e.g., “solid,” “convincing”) were most likely to have discordant ranks. By contrast, phrases in the alternative vocabulary tended to elicit more consistent interpretations across participants and interpretations that had less overlap with other phrases (Figs 2 and 3 and Tables 3 and 4). The alternative vocabulary was more likely to elicit implied rankings that matched the intended ranking—62% of participants had aligned rankings on the significance/importance dimension and 67% had aligned rankings on the strength of support dimension (Fig 4). Mismatched rankings were usually misaligned by one rank (Fig 5). Although the alternative vocabulary had superior performance to the eLife vocabulary, it was nevertheless imperfect. Specifically, interpretation of phrases away from the middle of the scale on both dimensions (e.g., “low importance” and “very low importance”) tended to have some moderate overlap (Figs 2, 3, and 6). We do not know what caused this overlap, but, as discussed in the next paragraph, one possibility is that it is overly optimistic to expect peoples’ intuitions to align when they judge phrases in isolation, without any knowledge of the underlying scale. Rather than presenting evaluative phrases in isolation (as occurs for eLife readers and occurred for participants in our study), informing people of the underlying ordinal scale may help to improve communication of evaluative judgements. eLife could refer readers to an external explanation of the vocabulary; however, prior research on interpretation of probabilistic phrases suggests this may be insufficient as most people neglect to look up the information [6,19]. A more effective option might be to explicitly present the phrases in their intended ordinal structure [19]. For example, the full importance scale could be attached to each evaluation statement with the relevant phrase selected by reviewers/editors highlighted (Fig 7A). Additionally, phrases could be accompanied by mutually exclusive numerical ranges (Fig 7B); prior research suggests that this can improve consistency of interpretation for probabilistic phrases [19]. It is true that the limits of such ranges are arbitrary, and editors may be concerned that using numbers masks vague subjective evaluations in a veil of objectivity and precision. To some extent we share these concerns; however, the goal here is not to develop an “objective” measurement of research quality, but to have practical guidelines that improve accuracy of communication. Specifying a numerical range may help to calibrate the interpretations of evaluators and readers so that the uncertainty can be accurately conveyed. Future research could also explore the relationship between the number of items included in the vocabulary and the level of precision that reviewers/editors wish to communicate. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 7. Explicit presentation of the intended ordinal structure (A–C), potentially with numerical ranges (B), could improve consistency of interpretation. Judgements by multiple reviewers could also be represented (by different arrows) without forcing consensus (C). https://doi.org/10.1371/journal.pbio.3002645.g007 Our study has several important limitations. First, we did not address whether editor/reviewer opinions provide valid assessments of studies or whether the vocabularies provide valid measurements of those opinions. We also note that eLife assessments are formed via consensus, rather than representing the opinions of individuals, which raises questions about how social dynamics may affect the evaluation outcomes. It may be more informative to solicit and report individual assessments from each peer reviewer and editor, rather than force a consensus (e.g., see Fig 7C). Although these are important issues, they are beyond the scope of this study, which is focused on clarity of communication. Second, we are particularly interested in how the readership of eLife interpret the vocabularies, but because we do not have any demographic information about the readership, we do not know the extent to which our sample is similar to that population. We anticipated that the most relevant demographic characteristics were education status (because the content is technical), knowledge of subject area (because eLife publishes biomedical and life sciences), and language (because the content is in English). All of our participants reported speaking fluent English, the vast majority had doctoral degrees, and about one third had a degree in the Biomedical and Life Sciences. Relative to this sample, we expect the eLife readership probably consists of more professional scientists, but otherwise we think the sample is likely to be a good match to the target population. Also note that eLife explicitly states that eLife assessments are intended to be accessible to non-expert readers [4], therefore, our sample is still a relevant audience, even if it might contain fewer professional scientists than eLife’s readership. Third, to maintain experimental control, we presented participants with very short statements that differed only in terms of the phrases we wished to evaluate. In practice however, these phrases will be embedded in a paragraph of text (e.g., Box 1) which may also contain “aspects” of the vocabulary definitions (Table 1) “when appropriate” [4]. It is unclear if the inclusion of text from the intended phrase definitions will help to disambiguate the phrases and future research could explore this. Fourth, participants were asked to respond to phrases with a point estimate; however, it is likely that a range of plausible values would more accurately reflect their interpretations [9,11]. Because asking participants to respond with a range (rather than a point estimate) creates technical and practical challenges in data collection and analysis, we opted to obtain point estimates only.

Conclusion Overall, our study suggests that using more structured and less ambiguous language can improve communication of research evaluations. Relative to the eLife vocabulary, participants’ interpretations of our alternative vocabulary were more likely to align with each other, and with the intended interpretation. Nevertheless, some phrases in the alternatively vocabulary were not always interpreted as we intended, possibly because participants were not completely aware of the vocabulary’s underlying ordinal scale. Future research, in addition to finding optimal words to evaluate research, could attempt to improve interpretation by finding optimal ways to present them.

Acknowledgments The Stage 1 version of this preregistered research article has been peer-reviewed and recommended by Peer Community in Registered Reports (https://rr.peercommunityin.org/articles/rec?id=488). Research transparency statement The research question, methods, and analysis plan were peer reviewed and preregistered as a Stage One Registered Report via the Peer Community in Registered Reports platform (https://doi.org/10.17605/OSF.IO/MKBTP). There was only 1 minor deviation from the preregistered protocol (target sample size = 300; actual sample size = 301). A Peer Community in Registered Reports design table is available in S4 Text. All data, materials, and analysis scripts are publicly available on the Open Science Framework (https://osf.io/mw2q4/files/osfstorage/). A reproducible version of the manuscript and associated computational environment is available in a Code Ocean container (https://doi.org/10.24433/CO.4128032.v1).

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002645

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/