(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Educating the future generation of researchers: A cross-disciplinary survey of trends in analysis methods

['Taylor Bolt', 'Department Of Psychology', 'University Of Miami', 'Coral Gables', 'Florida', 'United States Of America', 'Jason S. Nomi', 'Danilo Bzdok', 'Department Of Biomedical Engineering', 'Mcconnell Brain Imaging Centre']

Date: {year}-{month}

Methods for data analysis in the biomedical, life, and social (BLS) sciences are developing at a rapid pace. At the same time, there is increasing concern that education in quantitative methods is failing to adequately prepare students for contemporary research. These trends have led to calls for educational reform to undergraduate and graduate quantitative research method curricula. We argue that such reform should be based on data-driven insights into within- and cross-disciplinary use of analytic methods. Our survey of peer-reviewed literature analyzed approximately 1.3 million openly available research articles to monitor the cross-disciplinary mentions of analytic methods in the past decade. We applied data-driven text mining analyses to the “Methods” and “Results” sections of a large subset of this corpus to identify trends in analytic method mentions shared across disciplines, as well as those unique to each discipline. We found that the t test, analysis of variance (ANOVA), linear regression, chi-squared test, and other classical statistical methods have been and remain the most mentioned analytic methods in biomedical, life science, and social science research articles. However, mentions of these methods have declined as a percentage of the published literature between 2009 and 2020. On the other hand, multivariate statistical and machine learning approaches, such as artificial neural networks (ANNs), have seen a significant increase in the total share of scientific publications. We also found unique groupings of analytic methods associated with each BLS science discipline, such as the use of structural equation modeling (SEM) in psychology, survival models in oncology, and manifold learning in ecology. We discuss the implications of these findings for education in statistics and research methods, as well as within- and cross-disciplinary collaboration.

Funding: This work was supported by grants from the Canadian Institute for Advanced Research (DB), a Gabelli Senior Scholar Award from the University of Miami (LQU), a grant from the Social Science Research Council (LQU), an R01MH107549 from the National Institute of Mental Health (NIMH) (LQU), an R03MH121668 from the Institute of Mental Health (NIMH) (JSN), and a NARSAD Young Investigator Award to (JSN). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability: Two sources of peer-reviewed literature were used for this analysis: the Pubmed Central Open Access Subset (PMC OAS) (N = 2,869,889 articles at time of study); and the Pubmed Central Author Manuscript (PMC AM) collection (N = 659,133 articles). The PMC OAS provides access to full-texts from a total of 14,722 open access peer-reviewed journals (at time of study). The PMC AM collection provides access to full texts of manuscripts made available in PMC by authors in compliance with the NIH Public Access Policy. Both sources form part of PMC’s open access collection (37) ( https://www.ncbi.nlm.nih.gov/pmc/tools/textmining/ ). Bulk downloads of the full OAS and AM collection articles were conducted using the PMC FTP service.

Our survey found that analytic methods commonly taught in introductory research methods and statistics courses (e.g., t test, ANOVA) remain the most commonly mentioned methods in BLS research articles over the past decade. However, these methods have largely declined in prominence, or remained stable, over the past decade. On the other hand, multivariate statistics and machine learning methods have exhibited a consistent, sometimes exponential, increase in mentions from 2009 to 2020. Further, we found that analytic methods are not equally distributed across BLS disciplines, but tend to cluster into certain disciplines over others.

In this study, we conducted a systematic charting of analytic method usage across BLS disciplines over time. We applied natural language processing tools to a large corpus of open-access peer-reviewed literature. Our study aimed to map out the methodological landscape of the BLS disciplines and identify changing trends over the past decade (2009 to 2020). Here, we use the term “analytic methods” to broadly denote any quantitative or qualitative method for data analysis, including any algorithms, statistics, or models used to describe, summarize, or interpret a sample of data. This definition is meant to exclude those elements of research methodology involved in data collection or experimental or study design. “Study” is also broadly defined as a peer-reviewed quantitative- or qualitative-based assessment of measured data points, including experimental, observational, or meta-analytic research. We retraced trends in analytic methods across (1) time; and (2) research disciplines. From a temporal perspective, we identified analytic methods that have increased or decreased in prominence across BLS disciplines over the past decade (2009 to 2020). From a cross-disciplinary perspective, we identified analytic methods that are uniquely prominent within each BLS discipline and the similarity or dissimilarity of BLS disciplines, in terms of their usage of analytic methods.

Increasingly, analytic methods developed in one discipline find fruitful application in another. For example, deep learning, a machine learning technique developed by the artificial intelligence community, has successfully been used by biologists to predict three-dimensional protein structure [ 14 ]. The explosive adoption of neural networks across biology and multiple other fields illustrates the need for educational training to note these trends and keep pace with the demand for expertise in these emerging advanced analytic approaches.

The methodological landscape of the biomedical, life, and social (BLS) sciences is becoming increasingly complex. This increasing complexity is driven by the advent of open-source science [ 1 ], the availability of large, complex datasets [ 2 – 4 ], and increasing computational resources [ 5 – 7 ]. The classic statistical tools (e.g., t test, analysis of variance (ANOVA), and linear regression) taught in introductory statistics courses are, at times, perceived insufficient to prepare researchers for the age of big data, machine learning, and open-source software. Concerned that BLS sciences educational training is struggling to keep up with these trends, many researchers and statisticians have advocated for educational reform to introductory research methods and statistics courses [ 8 – 13 ]. We argue that a crucial step in this direction is a more complete understanding of actual trends in analytic method usage across BLS sciences. Such an understanding will offer valuable insights into the necessary methodological skills and knowledge needed to train early career scientists for future success in their disciplines and their interdisciplinary collaborations.

We subsequently focused attention on the discipline-specific method groupings: method groupings with almost exclusive use in 1 or a small subset of BLS disciplines. One example of a discipline-specific method grouping was component 11, a component exclusively represented in ecology and evolution and animal and plant disciplines. This set of methods included MDS, ANOVA-based methods, and distance matrix analyses (e.g., Mantel test). As noted above, manifold learning and distance matrix methods are uniquely suited to analyses of species composition and other types of data regularly collected in these disciplines. The appearance of ANOVA-based methods in this seemingly unrelated group of methods may seem surprising, but owes to the fact that variance partitioning of distance matrices is a historically common practice in ecological disciplines [ 29 ]. Another discipline-specific method grouping is component 16 represented most prominently by computer science and engineering and biotechnology disciplines. The analytic method with the strongest weight for this component is ANNs. So-called “deep learning,” ANNs with many layers of nodes, have seen an explosion of interest in recent years due to the increase in computational power and big data sources. Consistent with these findings, computer science, bioinformatics, and engineering/biotechnology disciplines have been at the forefront of methodological development in this area [ 30 ]. Another discipline-specific method grouping is component 5, consisting of latent variable models, including structural equation modeling (SEM) and confirmatory factor analysis (a subset of SEM). This component is most prominent in the discipline of psychology. Latent variable methods have found particular use in the field of psychometrics, where these models have been used to measure theoretical constructs, such as intelligence, personality, and attitudes [ 31 ].

The method groupings revealed by the tensor decomposition can be roughly classified into cross-discipline and within-discipline method groupings (i.e., components). Cross-discipline method groupings include components with a broad representation across the BLS disciplines (relatively more even distribution of discipline weights). For example, components 2, 6, 15, 19, and 20 had nonzero weights for the majority of BLS disciplines. Component 2 included methods broadly related to Bayesian statistics and concepts, such as credible intervals and posterior/prior probability distributions (e.g., beta distribution). This component is prominent across all research disciplines with particular prominence in computer science, ecology, and clinical research disciplines. Component 6 included bioinformatic algorithms for DNA sequence alignment, phylogenetic tree construction, and bootstrap resampling. This component is more prominent in biological science disciplines, such as ecology and evolution, animal and plant sciences, immunology, and biochemistry and molecular biology. Component 15 includes a variety of machine learning algorithms/metrics, including random forest algorithms, support vector machines, ANNs, and classification accuracy metrics (e.g., ROC and area under the curve). This component is prominent in all research disciplines with a particular prominence in chemistry and material sciences, and computer science.

To understand what analytic methods are frequently used together in the same study, we conducted a tensor decomposition of an analytic method co-occurrence by discipline tensor. The tensor decomposition analysis simultaneously models the co-occurrence between analytic methods, as well as their frequency of mentions in each discipline. This figure displays the discipline and analytic method weights from the tensor decomposition analysis. Components from the tensor decomposition are referred to as “method groupings” or groups of analytic methods that frequently occur together in study method and results sections. The top left panel provides a visual illustration of the tensor decomposition (nonnegative CANDECOMP decomposition) of the analytic method co-occurrence by discipline tensor. The first 2 dimensions of the tensor represent the logged sum of co-occurrences between each pair of analytic methods. The third dimension splits out the analytic method co-occurrences by discipline (i.e., the analytic method co-occurrences of articles within each discipline). For each component or “method grouping,” a stem plot illustrates the weights for each discipline, as well as the top 10 analytic methods, in terms of their weights (sized by their weight). For each component, the discipline weights represent the frequency of usage of that component across each discipline. Some sets of analytic methods are represented across all BLS disciplines (e.g., component 19), while others are concentrated within 1 or 2 disciplines (e.g., component 17). Discipline and analysis method weights for all 20 components are provided in S4 Data . ANOVA, analysis of variance; BLS, biomedical, life, and social; MANOVA, multivariate analysis of variance; MDS, multidimensional scaling; PCA, principal component analysis; ROC, receiver operating characteristic.

Analytic methods in the BLS sciences are rarely used in isolation. Rather, a set of methods are applied jointly, or in sequence, to understand a dataset. We term these frequently co-occurring analytic methods, “method groupings.” To directly extract coherent constellations of methods and understand how they vary across BLS disciplines, we applied a tensor decomposition approach to an analytic method (N = 250) co-occurrence by discipline (N = 15) tensor ( Fig 6 ; top left panel). For illustration, we displayed 11 components (method groupings) of a 20 component (i.e., 20 rank-one tensors) solution ( Fig 6 ). Visualization of all component weights is provided in S4 Fig . Each component is associated with separate weights for analytic methods and disciplines, indicating the analytic methods and disciplines most associated with the component, respectively.

As can be observed from Fig 5 , each discipline is associated with a distinct set of analytic methods. Some analytic methods appear across more than 1 discipline—e.g., Fourier analysis in chemistry/material sciences and engineering/biotechnology. Others are unique to a given discipline. For example, mentions of independent component analysis (ICA) appear much more in neuroscience research articles than those of other disciplines. ICA is a common method for decomposing multivariate signals (typically time series) into an additive mixture of statistically independent latent sources, which has found common use in neuroimaging (e.g., functional magnetic resonance imaging and electroencephalography) for source separation, artifact removal and detection of brain networks [ 25 , 26 ]. Another example is survival analysis methods (e.g., Cox regression and log-rank test) in oncology. Survival analysis methods, such as Cox regression or Kaplan–Meier analysis, aim to model the time to an event of interest. For clinical trials in the discipline of oncology, survival analysis methods have been found useful in modeling the effect of treatment on time to death [ 27 ]. Another example is the uniquely predominant use of PLS and PLS/DA in chemistry or chemometrics. PLS, a multivariate technique that predicts a set of response variables (Y) based on a set of predictor variables (X), is often used in chemometrics to relate properties of chemical samples (e.g., spectral properties) to their chemical composition (e.g., sample concentrations) [ 28 ].

Top 10 standardized chi-squared residuals for each discipline from the contingency table analysis, ranked from top to bottom. The font size of the analytic method string is sized by its (logged) standardized chi-squared residual. The greater the chi-squared residual, the greater the difference between the observed and expected number of analytic method entities within that discipline. Disciplines have a unique set of analytic methods frequently mentioned in their subject matter, e.g., neural networks in computer science, ICA in neuroscience, and Manhattan plots in population/behavioral genetics. Standardized chi-squared residuals for all analysis methods by domain are provided in S3 Data . ANOVA, analysis of variance; ARMA, autoregressive moving average model; ICA, independent component analysis; MDS, multidimensional scaling; ROC, receiver operating characteristic; PCA, principal component analysis.

To examine analytic methods associated with each BLS discipline, we used a contingency table approach. Specifically, we modeled the difference in observed versus expected article counts based on the number of article counts for each discipline and analytic method across the corpus. We used the standardized Pearson chi-squared residuals as an effect size measure of the degree to which an analytic method is more prominently associated with a given discipline relative to other disciplines. In Fig 5 , we display the top 10 analytic methods per discipline as measured by the chi-squared residual value. The raw probability for each analytic method by discipline is provided in the Supporting information ( S3 Data ).

As illustrated in Fig 4 , not all research disciplines follow the overall trend in analytic method mentions over the past decade. Some research disciplines exhibit a trend opposite of that observed in the overall trend. For example, mentions of null hypothesis testing concepts in population/behavioral genetics (POPGENE) research articles have declined in mentions from 2009 to 2020, while the overall trend remains fairly constant during that time span. Other research disciplines exhibit a steeper trend in mentions than that observed in the overall trend. For example, mentions of machine learning classifier methods in engineering/biotechnology (ENG) research articles exhibit a much steeper exponential increase in mentions from 2009 to 2020 compared with the overall trend.

In contrast, analytic methods covered in more advanced statistics and computer science courses have exhibited a marked increase in the proportion of mentions from 2009 to 2020. These analytic methods include dimension reduction/clustering analysis, machine learning classifiers (e.g., random forest classifiers, support vector machines, and artificial neural networks [ANNs]), nonparametric tests, partial least squares/discriminant analysis (PLS/DA), and regression subset selection methods (e.g., Lasso regression). Generalized linear models (GLIMs) (e.g., logistic regression, probit regression, and Poisson regression) have remained relatively constant over the past decade. However, analysis of the individual GLIM models belonging to this category shows that logistic regression, perhaps the most common GLIM model, has declined in mentions over the past decade ( S2 Data ).

While still dominant, there has been a marked decline in the proportion of articles mentioning parametric mean comparison tests (t test/ANOVA) over the course of 2009 to 2020. The same decline is observed for analysis of 2-way contingency or cross-tabulation tables, such as the chi-squared test and Fisher exact test. Other “introductory” statistical methods and concepts, such as linear regression and null hypothesis, have remained fairly constant in their proportion of mentions across the past decade. Interestingly, interval estimation methods, such as confidence intervals, have exhibited an increase in mentions over the past decade, perhaps reflecting the increased pressure from institutions and researchers [ 22 – 24 ] to report confidence intervals along with statistical significance tests and p-values in peer-reviewed research. As opposed to p-values, confidence intervals have the advantage of providing information regarding both the size and uncertainty of a point estimate.

Time series of 12 analytic method categories from 2009 to 2020 (annual frequency). For each analytic method category, the time series represents the proportion of articles that contained a mention of that category in their “Methods/Materials” or “Results” section per year. The time series of each analytic method category is displayed in its own plot with different y-axis scales. Note that because each plot differs in y-axis scale, caution should be observed when comparing trends across categories. To help readers compare y-axis scales across categories, we have provided a y-axis scale bar in the bottom right of each figure. An illustration of the y-axis scale bar is presented at the top of the figure. The height of the bar corresponds to the distance between the minimum possible proportion (0) and the maximum possible proportion (0.65) of all articles per year across all categories. The highlighted region of the bar (in blue) corresponds to that categories’ range of proportion values (across years) from 0 to 0.65. Random sampling variability for each proportion estimate was visualized using bootstrapped SEs from 100 bootstrapped samples of articles at each time point (dark shaded region: ± 1 SE, light shaded region: ± 2 SE). Overall, analytic methods taught in introductory research methods and statistics courses (e.g., t test/ANOVA, 2-way contingency tables, and linear regression) have shown differing rates of decline or remained stable in mentions over the past decade, with the exception of interval estimation approaches (e.g., confidence intervals). On the other hand, advanced analytic methods (e.g., machine learning classifier, regression subset selection methods, and clustering/dimension reduction) have shown a consistent increase in mentions over the past decade. Proportion of article counts by year for all analysis methods are provided in S2 Data . Python code for modeling trends of analysis methods is provided at https://github.com/tsb46/stats_history/blob/master/demo.ipynb . ANOVA, analysis of variance; CCA, canonical correlation analysis; MANOVA, multivariate analysis of variance; PLS/DA, partial least squares/discriminant analysis; SE, standard error.

The trends in analytic method category mentions are plotted individually (blue) in Fig 4 . Note that y-axis scales are relative for each analytic method category, and caution should be taken in comparing trends across categories. The statistical significance (p-value) of the linear trend for each analytic method category is displayed by each title. Those disciplines that exhibit a statistically significant interaction (p < 0.05; corrected for multiple comparisons with the Holm–Bonferroni method)—i.e., exhibit a deviation from the overall trend—are displayed with the overall trend.

While baseline differences in the proportion of articles that mention an analytic method category are important to consider, of primary interest for this survey are the relative trends in mentions across the time span of the study (2009 to 2020). To assess the statistical significance of the linear trends (i.e., decline or increase) in analytic method categories across the study time span, we used a logistic regression model. Specifically, for each analytic method, we modeled the log odds of an article mentioning that method conditional on an article’s publication year (2009 to 2020) and the article’s scientific discipline. To assess whether any research discipline exhibits a statistically significant deviation from the overall linear trend of a given analytic method category, interactions between the linear trend and discipline were added to the model. To account for the correlated/nonindependent structure of articles published within journals, the logistic regression model was estimated using a generalized estimating equation (GEE) approach. Full details of the model are provided in the Methods and materials section.

As can be observed from Fig 3 , introductory statistical methods—e.g., null hypothesis, testing, t tests/ANOVA, linear regression, and confidence intervals—have remained the dominant analytic methods mentioned in BLS methodology and results sections over the past decade. Null hypothesis testing concepts and statistics (e.g., p-values, null hypothesis, and alternative hypothesis) are by far the most commonly mentioned analytic “methods” in the BLS sciences—approximately 35% of articles per year mention 1 or more analytic methods in this category. The next most prominently mentioned analytic method category are ANOVA/t tests—statistics for comparing mean differences between 1 or more means collected from independent groups or repeated observations. Interval estimation approaches (e.g., confidence interval and credible interval) for calculating the possible values of a population parameter is the third most prominent analytic method category.

We make a distinction between those analytic method categories that are commonly taught in introductory research method and statistics courses—e.g., null hypothesis testing, t tests/ANOVA, linear regression, 2-way contingency tables (e.g., chi-squared test), and interval estimation (e.g., confidence intervals) [ 20 – 22 ] and those analytic method categories taught in advanced statistics or computer science courses—e.g., dimension reduction and clustering techniques (e.g., PCA, nonnegative matrix factorization (NMF), and K-means clustering), machine learning classifiers (e.g., support vector machines and random forest classifier), and regression subset selection methods (e.g., Lasso regression). We nominally refer to these 2 groups as “introductory” and “advanced” analytic methods, respectively. Note that we do not intend by this distinction to mean that “introductory” methods are less sophisticated or appropriate for data analysis than “advanced” analytic methods. This distinction is merely meant to distinguish between those analytic methods typically taught in introductory courses in data analysis from more advanced undergraduate and graduate data analysis courses.

Time series of 12 analytic method categories from 2009 to 2020 (annual frequency) displayed in a single line plot. For each analytic method category, the time series represents the proportion of articles that contained a mention of that category in their “Methods/Materials” or “Results” section per year. As can be observed from the plot, baseline differences in the proportion of mentions across analytic method categories are very prominent. Overall, analytic methods taught in introductory research methods and statistics courses (e.g., null hypothesis testing, t test/ANOVA, 2-way contingency tables, linear regression, and interval estimation) have been and still are the most mentioned category of analytic methods at the end of the decade (2020). Analysis of individual trends of each analytic method category ( Fig 4 ) reveals that while dominant, “introductory” analytic methods have exhibited a decline in mentions over the past decade, while “advanced” analytic methods have shown a consistent increase in mentions. Proportion of article counts by year for all analysis methods are provided in S2 Data . ANOVA, analysis of variance; MANOVA, multivariate analysis of variance.

A primary goal of our survey was to examine trends in analytic method mentions in research articles over the past decade (2009 to 2020). We first manually categorized analytic methods into larger superordinate categories of conceptually similar methods (analytic method categories—e.g., t test/ANOVA, generalized linear models [GLIMs], and survival analysis) (N = 34). Because the total number of articles in the corpus increased significantly year over year, the raw frequency counts for all analytic method categories exhibited a consistent increase in total counts over the time span of the corpus. In order to track what analytic method categories have increased or decreased in mentions relative to the total number of articles per year, we calculated the proportion of articles mentioning each category per year (2009 to 2020). We display yearly trends for 12 of the 34 analytic method categories in Fig 3 . Trends for all analytic method categories are provided in S1 and S2 Figs . A data-driven analysis of all analytic method trends without grouping into superordinate categories is provided in S3 Fig . Raw counts and proportions for all analytic methods without grouping into superordinate categories are provided in S1 and S2 Data , respectively.

To provide a visual illustration of the similarity between disciplines in their overall analytic method counts, we deployed classical multidimensional scaling (MDS). MDS is a simple manifold learning technique that expresses each discipline’s total analytic method counts in a parsimonious two-dimensional space ( Fig 2B ). The distances between the disciplines in the resulting plot reflect the dissimilarity/similarity in total analytic method counts. This approach made apparent that 2 disciplines stood out as relative outliers in analytic method mentions: Evolution/Ecology and Chemistry/Material Sciences. As depicted in Fig 5 , these select disciplines revealed a unique profile of analytic method mentions. For illustration, we consider the discipline of Ecology/Evolutionary Sciences. Compared with other BLS disciplines, distance matrix and manifold learning methods (e.g., MDS) are more widely used in the analysis of ecological data [ 16 – 18 ]. Such methods have been found to be uniquely suited for the analysis of species composition and abundance data [ 19 ]. For example, distance matrices constructed through metric/nonmetric dissimilarity metrics (e.g., Bray–Curtis dissimilarity) are used to represent a species-by-sample/site matrix. Manifold learning methods are routinely used to analyze the resulting distance matrices [ 19 ]. Manifold learning methods are often referred to as “ordination” in ecology.

(A) A horizontal stacked bar plot displaying the number of articles for the top 20 journals in the corpus (defined in terms of article count). The percentage of articles per domain within a journal are proportionally shaded within each bar (IJERPH). The research disciplines with the highest article counts were primarily biomedical and clinical disciplines. (B) MDS plot displaying the similarity between research disciplines, in terms of total analytic method counts (summed across all articles in the discipline), on a two-dimensional space. The x- and y-axis correspond to the 2 latent dimensions estimated from the MDS solution. The distance between 2 disciplines in this two-dimensional space communicates the dissimilarity in total analytic method counts between the 2 disciplines. (C) Top 50 analytic method entities were ranked row-wise by the number of mentions across the corpus. The size of each entity string is proportional to the logged (log 10 ) article count. The most frequently mentioned analytic methods were null hypothesis testing, correlation, confidence intervals, and linear regression. Data for all figures are provided in S1 Data . ANCOVA, analysis of covariance; ANIMAL, Animal/Insect/Plant Sciences; ANOVA, analysis of variance; BIOCHEM, Biochemistry/Cellular Biology/Molecular Genetics; CHEM, Chemistry/Material Science; CLINIC, Clinical/Hospital Research; CS, Computer Science/Informatics; ECO, Evolution/Ecology; ENG, Engineering/Biotechnology; ENVIRON, Environmental/Earth Science; EPIDEM, Public Health/Epidemiology; IJERPH, International Journal of Environmental Research and Public Health; IMMUN, Immunology; MDS, multidimensional scaling; NEURO, Neuroscience; ONCO, Oncology; PCA, principal component analysis; PHYSIO, Human Physiology/Surgery; POPGENE, Population Genetics; PSYCH, Psychology; ROC, receiver operating characteristic.

The corpus of open-access peer-reviewed literature predominantly consisted of science general journals, such as PLOS ONE, Scientific Reports, and Nature Communications ( Fig 2A ). This observation highlights one advantage of the machine learning classification of journal articles into scientific disciplines. The common practice of article classification by its journal publication would fail to capture the mixture of scientific disciplines contained within these science general journals. Discipline-specific journals with high article counts included Oncotarget (ONCO), BMJ Open (CLINIC, EPIDEM), BMC Genomics (BIOCHEM), Sensors (ENG), BMC Public Health (EPIDEM), and Frontiers in Psychology (PSYCH). These discipline-specific journals publish peer-reviewed articles in a specific area of study and have a more focused readership. The disciplines with the highest article counts are primarily biomedical and clinical disciplines: CLINIC (N = 333,547), EPIDEM (N = 172,949), BIOCHEM (N = 160,016), and ONCO (N = 111,818) ( Fig 2B ). The top 10 journals by discipline and the article counts for each discipline are provided in the Supporting information ( S1 Data ).

The analytic method entities extracted from all articles were used as input to 3 analytic pipelines: (1) analytic method trends to observe temporal trends in analytic method usage over the time window of 2009 to 2020 (at an annual frequency); (2) discipline by analytic method probability analysis to understand what analytic methods are unique to each BLS discipline; and (3) analysis of analytic method groupings to discover data-driven clusters of analytic methods that frequently co-occur within and across BLS disciplines. To promote reproducibility and reuse, the full code for all preprocessing and analytic processes are provided on the following web page: https://github.com/tsb46/stats_history .

The primary goal of this study is to describe and understand usage shifts in analytic methods across BLS disciplines over time. We analyzed approximately 1.3 million articles published in a decade of research to accomplish this goal. We extracted mentions/adoptions of analytic methods from “Methods and materials” and “Results” sections of a large corpus of peer-reviewed articles (PubMed Central Open Access Subset, PMC OAS [ 15 ]). We used a named entity recognition (NER) algorithm trained specifically for this purpose. We refer to these extracted mentions from the text as analytic method entities—unique strings of alphanumeric characters that refer to a distinct method for data analysis. The extracted entities then underwent a sequence of preprocessing steps including removal of unwanted characters and lemmatization (i.e., removing inflectional endings). The preprocessing workflow included a manual entity disambiguation step that classified entities referring to equivalent analytic methods to the same category—e.g., “Cox regression” and “Cox PH regression” were both classified as “Cox proportional hazards regression.” The final number of unique analytic method entities after these preprocessing steps was N = 250. In addition to pre-preprocessing of analytic method entities, articles were classified into a set of 15 research disciplines ( Fig 1 ) using a supervised machine learning framework pooling information from article titles, abstracts, and journal names. The 15 disciplines were chosen by the authors from a survey of the corpus to balance breadth and specificity of the BLS literature. The disciplines and their abbreviations are as follows: animal/insect/plant biology (ANIMAL), biochemistry and molecular biology (BIOCHEM), clinical research (CLINIC), computer science and informatics (CS), ecology and evolutionary science (ECO), oncology (ONCO), environmental science (ENVIRON), psychology (PSYCH), population and behavioral genetics (POPGENE), neuroscience (NEURO), chemistry and material science (CHEM), engineering and biotechnology (ENG), human physiology (PHYSIO), immunology (IMMUN), and epidemiology and public health (EPIDEM). The preprocessing pipeline is illustrated in Fig 1 . Details of the preprocessing pipeline are included in the “Methods and materials” section.

Discussion

The data analytic landscape of the BLS sciences is subject to change. The democratization and commoditization of tools for quantitative analysis have grown exponentially in the 21th century and only accelerated in pace in the past decade. This tectonic shift is due to the increased accessibility of computational resources, open-source software and abundance of big data in more areas of human activity. When learning to conduct data analysis, the scientist in training is faced with a steep hill to climb. To make this climb easier, graduate and undergraduate education must reflect the current practices and trends in data analysis. We offer an automated 12-year survey of approximately 1.3 million open research papers to characterize the data analytic landscape of the BLS sciences. This study aimed to provide a snapshot of the ongoing methodological shifts across a variety of scientific communities.

We find that the analytic methods commonly taught in introductory research methods and statistics courses (e.g., t test and ANOVA) remain the most commonly mentioned methods in “Methods” and “Results” sections of research articles. However, while dominant, these methods have largely declined in prominence, or remained stable, over the time span of the study (2009 to 2020). On the other hand, multivariate statistics and machine learning methods have exhibited a consistent, sometimes exponential, increase in mentions over the time span of the study. Further, we find that certain analytic methods are not equally distributed across BLS disciplines, but have unique prominence in certain disciplines over others. We believe our results provide valuable insights into how university curricula should be designed to meet the urgent need for training a new generation of quantitatively literate scientists.

“Multivariate statistics and machine learning methods” is a broad label, referring to a wide variety of analytic methods. These methods are often taught in advanced statistics and computer science courses and include PCA, regression subset selection, PLS, support vector machines, random forest algorithms, and ANNs. While some of these methods are quite old (e.g., PCA was first developed in the early 20th century), others are relatively new and still developing (e.g., ANNs have only seen broad use in the past decade). This study made no attempt to examine the potential causal factors behind the observed trends. We speculate that they could be due to several reasons: (1) the collection of larger and more complex datasets; (2) the recent popularity of data science as a tool in academia and industry; or (3) an increasing realization among researchers that manuscripts containing advanced analytics are more likely to impress reviewers and editors. However, advanced analytic methods still only represent a minority of the mentions we observed across BLS research articles. Although on different rates of decline or stability, the statistical methods taught in introductory courses, such as mean comparison tests (e.g., t test/ANOVA), cross-tabulation/contingency table analysis (e.g., chi-squared test), and null hypothesis testing, represent a much more sizable percentage of mentions across BLS research articles.

Projecting these trends into the future, we would expect that multivariate statistics and machine learning methods will enjoy increasing usage relative to more traditional statistical testing frameworks. The traditional statistical tool stack—t test, ANOVA, z-test, etc.—taught in undergraduate and graduate statistics courses were largely developed in the first half of the 20th century [32]. Since then, advances in computation and statistical computing software have revolutionized the analytic tools available to researchers. The application of advanced multivariate and machine learning analysis methods requires proficiency in statistical software and/or open-source programming languages that are largely absent from most traditional research methods and statistics curricula. Overall, these trends suggest that introductory research methods and statistics courses may benefit from incorporating a “data science” focus in their curricula [8,32].

Our survey demonstrates that methods for data analysis can vary widely across BLS disciplines. Several explanations can be offered for the distinct usage of data analysis methods between BLS disciplines. Perhaps the primary driver of a discipline’s adoption of research methods is the simple observation that the subject matter lends itself to the assumptions and goals of selected analytic methods. For example, consider the observed disproportionate use of SEM in the discipline of psychology (Figs 5 and 6 –component 5). Psychological research routinely relates observable behavior such as task performance or questionnaire responses to unobserved or latent variables. The desire to explore causal structure among these latent variables has led to the systematic adoption of SEM—a technique to specify and test causal structures among latent and observable variables [33]. Similar explanations can be offered for other method–discipline pairs, such as survival models and oncology. Other differences may arise from historical contingency, with no necessary connection between an analysis method and the subject matter it is applied to. For example, consider the predominant use of Fisher exact test in immunology versus the chi-squared test in clinical research (Fig 5). Both are statistical significance tests of the association between 2 categorical variables. The appropriate context for each test is controversial among statisticians, but Fisher exact test is commonly recommended over the chi-squared test with small sample sizes [34,35]. Despite this controversy, our analysis indicates the Fisher exact test is generally preferred over the chi-squared test in the field of immunology and vice versa in clinical research. Thus, one might hypothesize that either (1) the discipline of immunology works with smaller sample sizes on average than the discipline of clinical research; or (2) these differences arose for sociological or historical factors.

Differences in analytic method usage have concrete implications for the direction of research in each BLS discipline. The choice of experimental or observational design often entails the subsequent analytic method used to analyze the data, but a reverse influence occurs as well: The researcher’s knowledge of available analytic methods informs their experimental or observational design. For example, ANOVA models for analysis of group means have a historically close relationship with experimental design in social and life science research [36]. In other words, the influence between the choice of data analysis method and how data is collected operates in both directions. This observation underlies the potential for cross-fertilization and mutual inspiration between BLS disciplines by the discovery of new methods for data analysis, as well as novel ideas around data collection. While many advocates of cross-disciplinary collaboration have emphasized the joining together of different theoretical and subject matter expertise [37], our findings emphasize a further methodological benefit of collaboration, which affords practitioners access to novel methods of data analysis not yet widely known in their own disciplines.

It should be noted that the corpus and methods used in this analysis are limited in many respects. First and foremost, our study relies on mentions of an analytic method in a research article as an indicator for the usage of that analytic method in the article. However, the mention of an analytic method is not an unequivocal indication of its usage for data analysis in a research article. Therefore, we restricted our analysis of research articles to their “Methods” and “Results” sections, where mentions of an analytic method should be more closely associated with their usage. However, due to the infeasibility of spot-checking a large corpus (approximately 1.3 million articles) manually, there may be cases of analytic method mentions that did not imply their usage in our sample. Further, the entity recognition approach in our study requires that analytic methods are explicitly reported in methodology and results sections. However, inadequate and/or inaccurate reporting of protocols and statistical analyses is a known problem in the BLS sciences [38–40]. Thus, there are likely a number of research articles in our corpus where the full set of analytic methods employed was not captured by our analysis. Second, the usage of analytic methods does not imply that the method was applied appropriately or correctly. In fact, the inappropriate application of statistical methods may be a contributor to the replication crisis in the BLS sciences [23,41,42]. Thus, the increase in usage of multivariate and machine learning methods should not be considered prima facie evidence that these methods are being used appropriately. Third, our corpus only contains open-access articles made available by an open-access journal or an NIH-funded author. Thus, a sizable collection of peer-reviewed research in the past decade is systematically missing from this analysis. However, we assume that the type of publisher—open access or subscription based—is not a significant determiner of the methods used within a discipline. Fourth, some scientific disciplines may be less well represented in this survey, including experimental and theoretical physics, anthropology, astronomy, cosmology, economics, sociology, and geology. Future studies with a more comprehensive corpus of scientific publications will provide deeper insight into the historical and cross-disciplinary trends in scientific data analysis. Fifth, our classification of scientific disciplines requires that an article is assigned to one, and only one discipline. While the majority of research articles in our corpus may fit into one scientific discipline over others, some will be multidisciplinary, particularly for those disciplines with similar research agendas and regular collaboration (e.g., oncology, clinical research, and immunology). We have provided the original Python code of all preprocessing and analytic pipelines for those who wish to improve or redesign this study’s algorithms for future use (https://github.com/tsb46/stats_history).

Our data-driven survey of peer-reviewed articles reveals that the analytic landscape of the BLS sciences has been transformed over the past decade. A comparable rate of change will be required in education of budding scientists in statistics and research methodology. Equally important is the observed analytic diversity of BLS sciences of the past 10 years. The diverse analytic tool sets across BLS disciplines promises large payoffs for cross-disciplinary collaboration. In this vein, the recent advent of big data and open-source science is at least as much an opportunity for adequately training the next generation of researchers, as a challenge.

[END]

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001313

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/