(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Teaching students to R3eason, not merely to solve problem sets: The role of philosophy and visual data communication in accessible data science education [1]

['Ilinca I. Ciubotariu', 'Department Of Biological Sciences', 'Purdue University', 'West Lafayette', 'Indiana', 'United States Of America', 'Department Of Molecular Microbiology', 'Immunology', 'Johns Hopkins Bloomberg School Of Public Health', 'Center For Innovation In Science Education']

Date: 2023-07

Much guidance on statistical training in STEM fields has been focused largely on the undergraduate cohort, with graduate education often being absent from the equation. Training in quantitative methods and reasoning is critical for graduate students in biomedical and science programs to foster reproducible and responsible research practices. We argue that graduate student education should more center around fundamental reasoning and integration skills rather than mainly on listing 1 statistical test method after the other without conveying the bigger context picture or critical argumentation skills that will enable student to improve research integrity through rigorous practice. Herein, we describe the approach we take in a quantitative reasoning course in the R3 program at the Johns Hopkins Bloomberg School of Public Health, with an error-focused lens, based on visualization and communication competencies. Specifically, we take this perspective stemming from the discussed causes of irreproducibility and apply it specifically to the many aspects of good statistical practice in science, ranging from experimental design to data collection and analysis, and conclusions drawn from the data. We also provide tips and guidelines for the implementation and adaptation of our course material to various graduate biomedical and STEM science programs.

Funding: In this work, G.B. was supported in part by the National Institute of Allergies and Infectious Diseases (award number R25AI159447). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Ciubotariu, Bosch. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Introduction

In the past decades, there has been a growing acknowledgement of the need to improve the practice of science through better education in research integrity: Responsibility, Rigor, and Reproducibility—the 3 “R” core norms of good science [1–5]. For instance, at the Johns Hopkins Bloomberg School of Public Health, in the R3 Center for Innovation in Science Education (R3ISE), we produce graduate programs and curricula focusing on those core principles by including formal training in critical thinking across research disciplines, ethics in science and society, the nature and sources of mistakes, and responsible science communication [5–7]. Among the tenets of good research conduct fall good statistical practice, which encompasses the full data pipeline from experimental design and conduct to data collection and curation, data analyses, interpretation of results in context, full reporting, and appropriate quantitative and logical understanding [8,9].

Earlier previously published work has emphasized the separation between statistics and biology course work at the undergraduate level [10], and biology students lacking statistical training [11] can lead to statistical errors made by biologists at the professional stage, including statistical errors in high-impact journals [12,13]. There were many calls to change statistical training broadly with approaches like including problem solving [14], incorporating statistics into biology curricula [12,15], and developing data science initiatives for life science education [16]; however, this effort was concentrated at the college level.

At the graduate level, in our experiences [5] and those of others [17,18], there exist learning gaps in statistical training in the sciences, and this is often the first time these students receive advanced training in statistics [19]. For instance, in numerous seminars, committee meetings, and oral exams that we and colleagues attended (Personal Communications, Gundula Bosch), we frequently observed students uncritically operating statistical software, proving incapable of providing sound rationales for the application of a particular method or just making sense of statistical data. Thus, our motivation is to train students who needed to learn the basics of data science in their field to properly reason, not merely to solve problem sets. This has led us to develop a course for biomedical graduate learners, where we adapted our pedagogy approach in quantitative reasoning to include philosophical elements in statistics through exercises in reasoning and error reduction in statistical logic. Theoretically, this course could also be of particular utility to advanced-level undergraduate students interested in STEM careers, although an in-depth knowledge of public health and biomedical sciences is highly encouraged because many of the manuscripts we use as examples and resources in the course require a bit of background knowledge and topic understanding. Nonetheless, the foundational course material could be adapted and taught for other cohorts of interest.

This commentary provides information about our general approach and tips for other educators on how to adapt it to develop their own course modules in a similar fashion (Table 1). Our overall goal is to help improve research integrity through teaching good research practices, while including material centered around errors and fallacies that teach not only technical skills but also aspects of applied logic and ethics training, as well as fundamentals of responsible communication and data visualization.

Curriculum overview Data science education from errors points of view. The lack of reproducibility of key scientific findings has been a growing concern across many disciplines, in tandem with retractions [20,21]. The reasons behind the reproducibility debate are multifaceted, ranging from insufficient training in rigorous research methods, to sloppy literature outputs, to outright misconduct [22–25]. Additionally, logical fallacies, statistical mistakes during data analysis and interpretation, inadequate reporting of statistical techniques, and erroneous communication are other important contributing factors. For instance, in fields like neuroscience, cell biology, and medicine (but certainly not limited to these disciplines), numerous papers were found to have used inappropriate statistical techniques to compare experimental effects [26], had invalid statistical analyses [27], or contained multiple statistical flaws [28,29]. Hence, the rationale to approach data science education from an error point of view stems from this literature-supported experience that mathematical and statistical mistakes make up for a significant portion of mishaps in scientific practice [30]. In the R3 program across all of our courses, we refer to “errors” in science broadly as mistakes that stem from conceptual misunderstandings, lack of good practice skills in hypothesis testing, data analysis and management, as well as miscommunication [1]. This approach is broader than the statistical understanding of Type 1 or Type 2 errors, and although it allows us to cover those 2 specific errors in our quantitative reasoning course, we also cover other errors and mistakes in science that may occur in the statistical realm. For instance, through the literature, it is clear that there are a multitude of errors that could occur throughout the entire statistical process from the design of a study to data analysis, to the documentation of the application of statistical methods, to the presentation of study data, and finally to the interpretation of study findings. In other words, this could include a range of errors from failure to use and report randomization, to not providing test assumptions, to using presenting p-values as results and in interpretation [29]. Incorporating this lens in our course allows us to focus on science correction and create discussion for scientific advancement [31]. Teaching approach. The format of the course is balanced between short-style lectures, fireside chats with experts in the field to bring some of these statistical concepts to life, and various discussions and activities (see Table 1 for more details). The activities are applications of the lectures in some form, but also have a few specific learning objectives, depending on the topic discussed at hand in a respective week. The students discuss the use of statistical tests and can understand the tests, although we don’t place great emphasis on actual calculations. They also become versed in reading the scientific literature with examples of errors and statistical tests in practice and application. Further, we encourage the use of software like R, Tableau, or Excel for preparing the data visualization exercises (since this program is housed in the School of Public Health, we usually have students across departments and at different levels of expertise with programming languages and skills, so we provide some examples through the course material). Below, we describe our material in detail and present the activities and discussions we incorporate in our lessons. Hypothesis testing and p-values. First, we begin with the integration of data science into the scientific life cycle [32] and connect this to effective data visualization and storytelling through graphical representations. We travel through time from John Snow’s maps of the 1854 Broad Street Pump cholera outbreak [33] to the present day with examples of SARS-CoV-2 graphical summaries of research that were not the most effective in conveying important data and in turn, public health action [34–37]. We then introduce a hypothesis testing framework, which serves as a guide for all steps in research, from establishing a research question, identifying appropriate statistical tests, interpreting a test statistic outcome, and communicating one’s findings. Hypothesis testing conceived by Neyman and Pearson, in which a researcher considers null and alternative hypotheses to answer a scientific question, can lead to errors like Type 1 and Type 2, when a null hypothesis is rejected when true or when a null hypothesis is not rejected when false, respectively [38]. In our discussion of hypothesis testing, we emphasize the importance of planning an experiment and teach practice applications of common tests like t tests, ANOVA, and post hoc tests, thereby focusing on the reasoning aspect behind the applications. We also provide examples from the literature with respect to potential errors that might occur in the context of hypothesis testing (such as using unpaired tests for paired data, using unequal sample sizes for paired t tests), thereby emphasizing the “error lens.” Moreover, we highlight the need for clear and sufficient description in publication [39], as this information is key for the transparent reporting element of responsible science communication. Interwoven into the theory of hypothesis testing from a problem-centric point of view is the p-value debate that provides a measure of the strength of evidence against a null hypothesis and which has stirred much controversy due to its improper use [40–42]. “The p-value was never intended to be a substitute for scientific reasoning.”–Ron Wasserstein [43] We take a philosophical approach when teaching p-values, starting from the depths of Fisher’s initial work in which he described their use as an index for measuring the strength of evidence against the null hypothesis [44]. Using an error lens, literature examples help us illustrate instances of less rigorous or improper research practice, and misuse of p-values and statistical inference [8,45–47]. We describe the recent state of the field as presented by statisticians through various pieces of scientific literature [8,46,48–52], with the argument that general misuse of statistical methods and p-values have played a large part in science’s reproducibility crisis [8,47,53]. Further, we discuss the various perspectives [54] centered around the p-value debate, including the opinion that p-values may not be a useful metric even when used “correctly,” the view that p-values should be used in tandem with other metrics such as confidence intervals, etc., the suggestion that the “gold-standard” of the p-value threshold should be redefined to 0.005 [8,50,55–57]. The p-value has been reinterpreted as an “observed error rate” and as measure of evidence; it is not used today as it was initially intended [42,58,59] and it shouldn’t represent the be-all and end-all value of studies—although in many cases, it does. “Over time it appears the p-value has become a gatekeeper for whether work is publishable, at least in some fields.”–Jessica Utts [43] Null results, publication bias, HARKing, p-hacking, and more. Related to hypothesis testing, we highlight the concept of “null results” or experiments that produce results that do not support a hypothesis made prior to conducting the study [60]. Many times, null findings do not end up published in the scientific literature, which represents publication bias or the trend of journals to prioritize positive findings for publication—one of the 4 horsemen of the reproducibility apocalypse [61–63]. For instance, a study of social science experiments found that majority of null results, in contrast to “strong” results which are usually published in top journals, remain unwritten [60,64]. Also called the “file drawer phenomenon” [65], this can be detrimental to the scientific community, as null results may not be empty of biological meaning and can be as informative as “statistically significant” results that lack replication [66]. This also accentuates the “pressure to publish” that affects the quality of rigorous research and can lead to instances of irreproducibility with practices like p-hacking, selective reporting, or data dredging [67]. This refers to performing statistical analyses continuously until nonsignificant results become significant and reporting only significant results [68,69]. Similarly, HARKing, or hypothesizing after results are known [70], is another questionable research practice that we emphasize in our teaching through fireside chats and case studies we have developed with short examples related to biomedical sciences. We also include other issues such as the multiple testing problem or considering a set of statistical inferences simultaneously. In our course, we provide existing proposals and available solutions to addressing these issues such as the recommendation of “results blind evaluation” of manuscripts [71]. Recently, some journals are moving in the direction of welcoming the publication of negative findings and recognize their value in addressing important questions while being methodologically robust [61]. In another effort to combat this problem, researchers can preregister their studies or write registered reports [72], which allows for preconditioned acceptance after an initial peer review prior to data collection, thus emphasizing study design and sound methodologies rather than “significant” results. Not surprisingly, this has so far been shown to reduce publication bias and increase replication studies [73,74]. We teach that p-values cannot be interpreted in isolation, and statisticians have offered suggestions in this direction for many years; for instance, instead of thresholding, context is needed for interpretation with other robust measures such as sample size, effect sizes, and confidence intervals [51,52,75–77]. Another example of a session in which we relate content material to potential consequences of error or misuse is a discussion of chi-square test for independence, Pearson correlations, and the logical fallacy of false causality. Cum hoc ergo propter hoc indicates that tests which are meant to present relationships between variables or correlations should not be used to establish a cause-and-effect relationship, and we ask students to provide examples from the scientific literature. The overarching debate over the p-value and other elements clearly shows that institutional changes are needed, including better training and educational resources that highlight reasoning and error reduction, and this is what our course aims to accomplish [9,53,78].

Responsible communication—Essential to scientific reasoning As part of the R3 program, we have previously highlighted the importance of rigorous research and presented our framework for responsible scientific communication training as means by which research integrity can be improved [7]. Responsible science communication, which includes value-based principles built on established ethical principles, such as honesty, stewardship, objectivity, openness, accountability, and fairness, encompasses processes from the lab bench to the dissemination of scientific work [7]. By including the philosophical elements of reasoning and error reduction in statistical training, we also realized that now more than ever, visualization is a form of responsible scientific communication; hence, we included it as a connecting thread through our lectures. Throughout the SARS-CoV-2 pandemic, it has become much clearer that responsible scientific communication is necessary to translate research into public health practice and this includes accurate representation and display of data. Data display is a form of visual communication, and this tool is useful when reaching out to the public and conveying research findings to bring data to life. There have unfortunately been many examples of poor graphical representation of data, ranging from inappropriate graph choice to wrongly scaled axes, and to nondigestible information. We reflect on these in the course and ask students to reconstruct data display in new form through guidelines we cover in lectures. To this end, one important exercise we highlight throughout the course (see Table 1) is the ability to responsibly communicate data to both one’s research colleagues with STEM-focused training and non-scientists. We ask our students to pick specific examples of both effective and ineffective presentation of data from the literature and briefly explain the main message of the selected plots. This exercise, while challenging, is rewarding because it allows the students to reflect on how to improve both their data representation and their communication—parts that are both essential for furthering science and its results. The themes of responsible science communication and data visualization represent 2 examples of continuity elements which we revisit in multiple course sessions, and we believe highlighting an exercise throughout the course in various forms and contexts prepares the students to profoundly grasp a concept such as hypothesis testing. Another example we review is that of p-values, through older studies that have misused this metric and both the current biomedical literature (see Table 1 for details).

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011160

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/