(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .
A phylogeny-informed characterisation of global tetrapod traits addresses data gaps and biases [1]
['Mario R. Moura', 'Departamento De Biologia Animal', 'Universidade Estadual De Campinas', 'Campinas', 'São Paulo', 'Departamento De Biociências', 'Universidade Federal Da Paraíba', 'Areia', 'Paraíba', 'Department Of Ecology']
Date: 2024-07
Tetrapods (amphibians, reptiles, birds, and mammals) are model systems for global biodiversity science, but continuing data gaps, limited data standardisation, and ongoing flux in taxonomic nomenclature constrain integrative research on this group and potentially cause biased inference. We combined and harmonised taxonomic, spatial, phylogenetic, and attribute data with phylogeny-based multiple imputation to provide a comprehensive data resource (TetrapodTraits 1.0.0) that includes values, predictions, and sources for body size, activity time, micro- and macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences, and human influence, for all 33,281 tetrapod species covered in recent fully sampled phylogenies. We assess gaps and biases across taxa and space, finding that shared data missing in attribute values increased with taxon-level completeness and richness across clades. Prediction of missing attribute values using multiple imputation revealed substantial changes in estimated macroecological patterns. These results highlight biases incurred by nonrandom missingness and strategies to best address them. While there is an obvious need for further data collection and updates, our phylogeny-informed database of tetrapod traits can support a more comprehensive representation of tetrapod species and their attributes in ecology, evolution, and conservation research.
Funding: We gratefully acknowledge São Paulo Research Foundation (FAPESP) for grants supporting MRM (#2021/11840-6 and #2022/12231-6), LFT (#2016/25358-3), KC (#2020/12558-0), and RZC (#2022/15247-0); Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for the fellowship to JJMG; Conselho Nacional de Desenvolvimento Científico - CNPq for research grants in support of FPW (#311504/2020-5) and LFT (#302834/2020-6); U.S. National Science Foundation (NSF) for grants supporting RCKB (DEB-1441652), RAP (DEB-1441719), and WJ (DEB-1441737 and DEB-1441719). WJ also acknowledges support from NASA grants 80NSSC17K0282 and 80NSSC18K0435. This work was partially supported by E.O. Wilson Biodiversity Foundation in furtherance of the Half-Earth Project. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright: © 2024 Moura et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
We use this gap-filled database to assess the geographic, taxonomic, and trait-related biases and evaluate how their model-based closure supports improved information and biological inference. Due to the biodiversity knowledge paradox—high biodiversity in the tropics [ 33 ] but better taxonomic sampling in temperate regions [ 4 , 5 , 34 ]—we expect larger unsampled fractions in the tetrapod trait space for tropical species. Similarly, the high research capacity (i.e., infrastructure and expertise availability) dedicated to birds and mammals relative to amphibians and reptiles [ 2 , 3 ] contributes to the uneven sampling of trait space [ 4 ], likely producing larger biases among historically undersampled taxa. Finally, species biology and sampling methodologies are known to affect detection and collection rates in the field [ 35 – 37 ]. For example, detectability is typically lower for small- than large-bodied species, and similarly so for nocturnal relative to diurnal taxa [ 38 – 40 ], whereas sampling methodologies often favour the collection and research of species living on the surface compared to fossorial or arboreal groups [ 34 , 41 – 44 ]. We thus anticipate convergent missingness across trait space in tetrapods and expect undersampled species to typically be small, nocturnal, and fossorial or arboreal.
We leveraged a fast and automated multiple imputation technique with additional data mobilisation to provide a comprehensive database and assessment of key ecological attributes of all extant 33,281 tetrapod species covered in recent fully sampled phylogenies, including 7,238 amphibians [ 25 ], 384 chelonians and crocodilians [ 20 ], 9,755 squamates and tuatara [ 26 ], 9,993 birds [ 27 ], and 5,911 mammals [ 28 ]. Our assessment covered standardised species-level attributes for taxonomy, body size, activity time, microhabitat, macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences, and human influence. Since not all species have genetic data (representing an important source of the remaining uncertainty about their placements in available phylogenies), we also evaluated completeness in genetic sequences [ 29 ]. We pinpointed taxa exhibiting pronounced shared missingness in natural history data to inform new strategies for data acquisition and mitigate biases in trait databases. To enhance database consistency, we taxonomically harmonised data sources and filled gaps using a phylogeny-based multiple imputation method [ 30 – 32 ] for which we verified the performance and associated uncertainty.
Despite the continued limitation of sampled data for many attributes in tetrapods, new methods can help to minimise these gaps and improve our understanding of biodiversity. Past practices have included the removal of species with missing data or the replacement of missing values by observed averages, but these strategies may ultimately reduce statistical power and increase bias [ 7 , 10 , 14 ]. More recently, growth in large-scale phylogenetic analyses has boosted the development of methods to increase the accuracy of imputing missing values [ 15 – 18 ]. Among tetrapods, recent large-scale applications of imputation methods include the use of phylogenetic regression methods and machine-learning techniques to predict missing values in trait data for amphibians [ 19 ], reptiles [ 19 , 20 ], birds [ 19 , 21 ], and mammals [ 19 , 21 , 22 ], as well as to inform threat statuses for data deficient and non-assessed species [ 20 , 23 , 24 ].
While some missing mechanisms (e.g., MCAR, MAR) primarily affect a single attribute, the underlying cause behind an attribute MNAR can influence multiple variables, resulting in co-missingness or shared gaps. For example, a species might lack information on multiple ecological aspects due to being known from only a few specimens with no details on where, when, and how they were found [ 12 ]. Similar circumstances apply to rare species or those collected solely through passive sampling techniques (e.g., pitfall traps), leaving ecological data unobserved. Indeed, bias is recognised in the availability of trait data for certain taxa, regions, and traits [ 4 ], and missing values for a given variable may be associated with incompleteness in others. Such congruent or aggregated (as opposed to segregated) patterns in trait missingness can arise from societal and research preferences for charismatic species [ 3 ], easily sampled taxonomic groups, or accessible geographical regions [ 6 , 13 ]. Conversely, segregated patterns may reveal traits and taxa that are challenging to sample or underrepresented.
Understanding causes of missingness is instrumental to advancing biodiversity data coverage. For any species attribute, there are observed and unobserved entries, each with a probability of being missing [ 9 ]. When all entries, whether observed or unobserved, share the same likelihood of being missing, data are said to be missing completely at random (MCAR). If missingness affects only observed entries, the data is termed missing at random (MAR). For example, a depleted digital scale battery makes weighing subsequent specimens impossible in the field, resulting in MAR data. Data is considered missing not at random (MNAR) when missingness is tied to unobserved entries, indicating a link to the missing values themselves. To illustrate, species exclusive to relatively inaccessible habitats, such as the forest canopy, may be systematically overlooked in field surveys, with their data missingness linked to the occupied microhabitat. These 3 missing mechanisms—MCAR, MAR, and MNAR—can lead to different configurations of the invisible fraction of the trait space [ 9 – 11 ].
Over the past two decades, biodiversity science has seen a dramatic growth in large-scale research in ecology, evolution, and conservation biology, enabled by near-global coverage for study systems such as terrestrial vertebrates, or Tetrapoda (amphibians, reptiles, birds, and mammals). These efforts usually rely on datasets spanning wide temporal, spatial, and taxonomic scales [ 1 ] that ideally are fully harmonised and well curated. Despite terrestrial vertebrates being a relatively well-known animal group when compared to invertebrates and plants [ 2 , 3 ], notable gaps persist across various attributes, including fundamental aspects of species natural history [ 4 , 5 ]. Consequently, trait-based research on biodiversity is often hampered by spatially and phylogenetically incomplete datasets [ 6 – 8 ].
Methods
We curated and assembled available databases for global tetrapod groups and used the latest phylogeny-based methods to create the most comprehensive tetrapod attribute dataset to date. While the TetrapodTraits database also covers a wide range of attributes derived from species range maps (see S1 Table, Supporting information), our focus regarding the imputations primarily centred on natural history traits, specifically: body length, body mass, activity time, and microhabitat. Our procedures can be summarised in five general steps: (i) data acquisition; (ii) taxonomic harmonisation; (iii) outlier verification; (iv) taxonomic imputation; and (v) phylogenetic multiple imputation.
Data acquisition We compiled species-level attributes regarding taxonomy, body size, activity time, microhabitat, macrohabitat, ecosystem, threat status, biogeography, insularity, environmental preferences, and associated data sources for each tetrapod species (S1 Table). We gathered information from several global, continental, and regional databases, and complemented the existing data from published (articles, book chapters, and field guides) and grey literature (e.g., technical reports, government documents, monographs, theses). We also incorporated unpublished data gathered during fieldwork performed by some of us. To minimise the uneven representation of ecological attributes across clades, we initially identified genera and families whose species did not have available data on body size, activity time, or microhabitat. We then used species belonging to these genera and families to carry out additional online searches on academic platforms (Google Scholar and Web of Science) and included complementary attribute data whenever possible. To improve the chances of finding relevant natural history data [45,46], we conducted these searches using natural history terms in English (e.g., activity time, microhabitat, body size, length, mass, weight), Portuguese (e.g., tempo de atividade, micro-habitat, tamanho de corpo, comprimento, massa, peso), and Spanish (e.g., tiempo de actividad, microhabitat, tamaño del cuerpo, longitud, masa, peso) along with the respective species scientific name or unique synonyms (see Taxonomic harmonisation section). When sources were available in other languages, we employed translation tools for inspection (e.g., Google Translate). In our examination of the data sources, we did not use trait values provided solely at the genus level (e.g., mean value per genus). Briefly, taxonomic data were represented by higher-level taxonomic ranks (Class, Order, Family), scientific name (same spelling as used in recent fully sampled phylogenies), authority name, and year of description. Three broad natural history traits—body size, activity time, and microhabitat—have been compiled and harmonised across different tetrapod groups. Body size data consisted of information on body length (mm) and body mass (g). Activity time encompassed whether the species was diurnal and/or nocturnal. Cathemeral or crepuscular species were considered as both diurnal and nocturnal. Microhabitat included 5 categories of habitat use commonly reported in field guides and related literature: fossorial, terrestrial, aquatic, arboreal, and aerial. Microhabitat categories are not mutually exclusive, meaning that a species can be present in more than one category to represent intermediate microhabitats, such as semifossorial (which involves both fossorial and terrestrial categories) or semiarboreal (which combines terrestrial and arboreal). Exceptionally for birds, we adapted microhabitat data from the EltonTraits database [47], which describes the estimated relative usage for seven types of foraging stratum. To make our definition of microhabitat similar across tetrapod groups, we reduced these seven categories to four by: summing the relative usage of species foraging below the water surface or on the water surface in the aquatic microhabitat; summing the relative usage of species foraging on the ground and below 2m in understorey as terrestrial; summing the relative usage of species foraging 2m upward in the canopy and just above canopy as arboreal. Species with aerial microhabitat were kept as defined in EltonTraits database [47], and no fossorial bird was reported in the later source. We then made binary the relative usage of aquatic, terrestrial, arboreal, and aerial microhabitat using a threshold of 30% to consider a species as typical of a given microhabitat type. In a departure from previous mammal databases that treated fossorial and terrestrial species collectively as “terrestrial,” we have reviewed microhabitat data to consider fossorial life-style separately from terrestrial [22,47]. Macrohabitat data followed the IUCN Habitat Classification scheme v. 3.1 [48]. This scheme describes 17 major habitat categories in which species can occur and an 18th category for species with unknown major habitat (not included here). We also gathered data on species’ major ecosystem (terrestrial, freshwater, and marine). For both macrohabitat and ecosystem, we initially used the rredlist package [49] to obtain macrohabitat for 31,740 species and ecosystem for 32,442 species. Our macrohabitat variables correspond only to the first level of IUCN Habitat Classification scheme. For an additional 769 species, we extracted macrohabitat data from relevant literature, bringing the coverage to 32,509 species (97.7% of all species considered). Ecosystem data was extracted from the literature for another 228 species, encompassing 32,670 species and accounting for >98.2% of the total number of species. We used the rredlist package [49] to obtain non-DD assessed status for 29,237 tetrapod species based on IUCN red list v. 2023–1 [50]. For 490 species not available via rredlist, we used non-DD assessed statuses matching those described in previous IUCN assessments and included in works using the same taxonomy of fully sampled trees [20,24,26,51]. We also used data on recent published assessment on amphibians [52] and chelonians [53] to inform assessed status for additional 137 species. Across all sources consulted, data deficient species totalled 2,936 species. We did not find an assessed status for 508 species. To enhance the usability of TetrapodTraits, we also provide the respective IUCN binomials for 32,098 species based on IUCN 2023–1 [50]. To compute spatially based attributes in TetrapodTraits, we derived expert-based range maps for amphibians [24,48,54], reptiles [20,33,48], mammals [48,54–56], and birds [27,54]. We matched the authoritative expert range maps for each of the tetrapod groups with the corresponding phylogenies and edited species ranges to ensure that they represented the species concept adopted in the corresponding phylogeny. Overall, our verification procedure of the species range maps can be summarised under 10 scenarios: (i) no changes, where species range maps matched directly with binomials in the phylogeny; (ii) synonyms, where species range maps were direct synonyms to binomials in the phylogeny, thus requiring only an updated name; (iii) split, where species range maps needed to be clipped from a parent species, or when parent species needed to have part of their range removed; (iv) lumps, where species range maps needed to be combined with those of other species; (v) new species-1, where no range map was previously available, so we derived ranges based on recent literature; (vi) new species-2, in the absence of any published map, we drew 10 km radius buffer around point occurrence data (including the species type locality); (vii) new species-3, in the absence of point occurrence data, we drew a polygon around nearby geographical features reported in the literature (e.g., boundaries of a municipality or protected area). We used two additional scenarios for extinct species [56] by referencing the natural ranges of either (viii) extant or (ix) extinct species that coexisted with fossil records of the target extinct species. The last scenario refers to (x) domesticated species, which were represented by their natural ranges before domestication. Homo sapiens had its range map represented by the overlapping of all range maps. We did not derive range maps for 11 species (1 amphibian, 2 bats, and 8 squamates) because information on their occurrence was either vaguely defined (e.g., continental land mass or very large administrative unit) or completely absent. Species expert-based maps were used to compute different attributes related to species range and biogeography. We extracted the latitude and longitude centroids of each range map. Range size was measured as the number of 110 × 110 km equal-area grid cells intersected by each species, a spatial resolution that minimises the presence of errors related to the use of expert ranges maps [57–59]. We recorded the presence of a species in a grid cell if any part of the species distribution polygon overlapped with the grid cell. We then computed the proportion of the species range overlapped by each biogeographical realm. To define if a species was insular endemic or not, we used the literature available [60,61]. We further completed insularity data by registering species whose range maps intersected with minor (<2 km2) and major islands worldwide. Island vector data was sourced from Natural Earth (www.naturalearthdata.com, [62]) databases, v. 4.1.0 and v. 5.1.1 for minor and major islands, respectively. Species missing range maps were assumed as non-insular based on the collection of type specimens within major continental land masses (e.g., South Asia, South America, West Africa). Finally, to inform spatially based attributes we initially extracted the median value of environmental [63,64] and human influence variables [65,66] per grid cell. We then calculated their weighted average within each species range, using the species range occupancy per cell as weights. This approach aimed to reduce the impact of marginally occupied cells on the attributes derived from within-range measurements [67]. All within-range attributes are individually described in the Results and discussion section (see also S1 Table).
Taxonomic harmonisation The taxonomy of the TetrapodTraits database follows the respective taxonomies of the recent, fully sampled phylogenies for each of the major tetrapod groups [20,25–28]. The amphibian phylogeny taxonomy [25] follows the 19 February 2014 edition of AmphibiaWeb (
http://amphibiaweb.org), with 7,238 species. The phylogeny for chelonians [20] follows the Turtles of the World (8th ed.) checklist [68], with the addition of Aldabrachelys abrupta and Al. grandidieri, and the synonymisation of Amyda ornata with Am. cartilaginea, adding to 357 chelonian species. For crocodilians [20], the taxonomy followed [69], complemented by the revalidation of Crocodylus suchus [70] and Mecistops leptorhynchus [71], and the recognition of three Osteolaemus species [72,73], resulting in 27 species. The taxonomy of the squamate and tuatara phylogeny [26] follows the Reptile Database update of March 2015 (
http://www.reptile-database.org) with 9,755 species. For brevity, we refer hereafter to species in the latter phylogeny as squamates, although we recognise that the tuatara, Sphenodon punctatus, is not a squamate. The taxonomy of the bird phylogeny [27] followed the Handbook of the Birds of the World [74], including 9,993 species. Finally, the taxonomy of the mammal phylogeny [28] follows the IUCN [75] with modifications resulting in a net addition of 398 species, bringing the total to 5,911 species. To maximise data usage from previous compilation efforts, and to ensure coherence among species names and the multiple data sources, we built lists of synonyms and valid names based on multiple taxonomic databases [48,76,77], and extracted the unique synonyms in each taxonomic database. By unique synonym, we refer to a binomial (scientific name valid or not) applied to only one valid name. We then performed taxonomic reconciliation based on four steps: Direct match with data sources: We directly paired the names of each of the 33,281 species in TetrapodTraits with the potential source of the data. Species-level attributes of closely related species could appear as identical values if the attributes had been extracted from sources in which different species were treated as synonyms. We minimised the inclusion of duplicated values by flagging each taxonomic match between TetrapodTraits and external data sources to ensure that each data entry was made only once. Direct match with data source synonyms: For the species we were unable to directly match with the data source in step 1, we updated the taxonomy using the list of unique synonyms, and then performed a new matching operation which allowed us to extract and flag additional data whenever possible. Direct match with TetrapodTraits synonymies: Some data sources may follow more recent taxonomies than those inherited from the fully sampled phylogenies [20,25–28]. For species without a direct match after step 2, we updated their taxonomy in the TetrapodTraits using the list of unique synonyms and repeated the data extraction and flagging procedures. Manual verification: For species without a direct match after step 3, we manually searched the specialised taxonomic databases (amphibians [76], reptiles [77], birds [78], and mammals [79]) for potential spelling errors and/or additional synonyms not yet included among our synonym lists. Whenever possible, we updated the taxonomy applied to data sources and then repeated the data extraction and flagging procedure. Species without data after the completion of step 4 were classified as missing data.
Outlier verification We implemented two approaches to detect potential inconsistencies in continuous attribute data, body length, and body mass, before applying phylogeny-based methods to impute missing values. Interquartile range criterion: Species body length and body mass were log 10 transformed and then flagged as outliers if their value were outside the interval defined between [q 0.25 −1.5 × IQR] and [q 0.75 + 1.5 × IQR], where q 0.25 and q 0.75 are respectively the first and third quartiles, and IQR is the interquartile range [q 0.25 – q 0.75 ]. Deviation from allometric relationship: Although allometric escape is a phenomenon observed in nature, we used interactive scatterplots to flag species with unusual deviations from the expected allometric relationship between body length and mass. We inspected allometric relationships separately for species within each Class, Order, and Suborder. For species flagged in steps 1 or 2 above, we checked body length and/or body mass for validity and corrected these values where necessary. Data entries that could not be confirmed using a reliable source were purged from the database.
Taxonomic imputation The global scope of the present database inevitably includes some gaps that are hard to fill, e.g., natural history data for species known only from the holotype or a few specimens [12,80]. Previous studies have addressed this challenge and reduced data missingness by using values imputed at the level of genus or from close relatives [47]. Although these earlier strategies of “taxonomic imputation” might artificially reduce variability in attribute values, they are useful for filling gaps in highly conserved attributes, and can ultimately help increase the performance of phylogeny-based imputation methods applied in concert with correlated attribute data [81,82]. We used taxonomic imputations for two cases of missing data in microhabitat: chiropterans, who were considered “aerial” (112 species), and dolphins and whales who were considered “aquatic” (5 species). For the remaining tetrapod species, we computed the per-genus proportion of species in each type of activity time (diurnal or nocturnal), microhabitat (fossorial, terrestrial, aquatic, arboreal, aerial), macrohabitat (17 binary variables informing the IUCN Habitat Classification scheme), and ecosystem (terrestrial, freshwater, marine). If a type of activity time, microhabitat, macrohabitat, or ecosystem appeared in at least 70% of species in the genus, we assumed this ecological attribute was also present among species with missing data in the respective genus. Our goal was to reduce missing values in activity time, microhabitat, and macrohabitat for groups with well-known ecologies (observed data available for at least 70% of species) before running phylogeny-based imputation methods. The number of tetrapod species receiving taxonomic imputations totalled 866 for activity time, 1,110 for microhabitat, 772 for macrohabitat, and 611 for ecosystem. We did not use taxonomic imputation for continuous attributes.
Phylogenetic multiple imputation To minimise missing values and capture their uncertainty, we applied the mixgb method [30], a recently developed approach that combines the tree-based algorithm XGBoost [83] with predictive mean matching (PMM) [84], a multiple imputation technique. XGBoost captures interactions and nonlinear relations among variables, while PMM, alongside subsampling, addresses variability associated with missing data. PMM assigns imputed values to each missing entry based on a group of k donors whose predicted values are the most similar among the observed entries. One donor is then randomly selected, and its observed value is used for imputation [84,85]. The process is repeated m times to produce multiple imputations. When PMM uses a single donor without subsampling, imputations are expected to be identical. The XGBoost does not directly include a phylogenetic tree into its computations. To account for phylogenetic information, we used the phylogenetic covariance matrix of each fully sampled tree [20,25–28] to derive a set of phylogenetic filters (eigenvectors). We determined the number of phylogenetic filters to retain using the broken stick rule [86]. The selection of phylogenetic filters was performed separately for each tetrapod group and across the subset of 100 trees. To assess the reliability of imputed data, we initially filtered a subset of species with complete data within each tetrapod group. Then, we randomly partitioned these subsets into 10 folds for cross-validation. In each iteration, one fold was excluded from the training process and used as testing data in subsequent modelling. Continuous variables (body length and body mass) were log 10 -transformed to reduce skewness, while binary variables represented types of microhabitat (fossorial, terrestrial, aquatic, arboreal, aerial) and activity time (diurnal and nocturnal). Note that microhabitat and activity time types are non-mutually exclusive. For birds only, we complemented the observed attribute data with 10 morphometric traits (log 10 -transformed) made recently available through the AVONET database [87]. XGBoost places a central emphasis on the tuning of hyperparameters, covering aspects such as learning rates, tree topology, subsampling, weighting, and regularisation [83,88]. Our tuning procedure began with an initial grid search, exploring 1K parameter combinations uniformly draw from specified ranges for five key hyperparameters: learning rate (η) from 0.01 to 0.3, maximum tree depth from 3 to 12, subsample portion of training data from 0.7 to 1, minimum child weight from 0.5 to 1.5, and number of boosting iterations (nrounds) from 30 to 1,000. For continuous traits, our goal was to minimise the normalised root mean square error (NRMSE) in XGBoost regression models using the “reg:squarederror” objective, while for binary traits, we sought to reduce misclassification error in models using the “binary:logistic” objective. We refined the hyperparameter selection by reassessing model performance with an additional 1K parameter combinations uniformly drawn from the parameter ranges defined by the top 5% of models. The tuning procedure was performed separately for each combination of response variable and tetrapod group. Other XGBoost parameters were kept at their default values. Following hyperparameter tuning, we trained mixgb models under the 10-fold cross-validation approach. Each mixgb model considered the predictive mean matching with 10 donors and provided 10 imputations for each missing entry. The selection of donors, crucial for predictive mean matching in mixgb models, is based on exploring the multivariate predictor space (i.e., phylogenetic filters and natural history traits) during the tree building process in XGBoost models. Our framework yielded a total of 10K imputed values for each missing entry across the subset of 100 phylogenetic trees (= 10 imputed values × 10 validation folds × 100 phylogenies). In each iteration, we assessed the reliability of imputations using four distinct validation metrics: Pearson correlation: computed between imputed and observed values for continuous attributes (body length and body mass). Regression slope: computed between imputed and observed values, where slope values >1 indicate overestimation, and a slope <1 indicates the underestimation for continuous attributes (body length and body mass). Normalised root mean square error (NRMSE): computed for continuous attributes (body length and body mass), with lower values indicating higher accuracy [89]. Accuracy: measured the proportion of correctly classified entries, computed for binary categories representing microhabitat (fossorial, terrestrial, aquatic, arboreal, aerial) and activity time (diurnal and nocturnal). Overall, the number of tetrapod species receiving phylogenetic multiple imputations totalled 8,123 for body length, 6,752 for body mass, 6,756 for activity time, and 445 microhabitat. All computations were carried out in R version 4.2.3 using the mixgb v. 0.1.0 [30], ape [90], gmodels [91], hydroGOF [92], and stats [93] packages. Raw data, code, and the 10K imputed values per missing entry per species are reported under data availability [94,95]. In our approach, there are four sources of variability when producing multiple imputed values for each missing data entry. Firstly, the PMM with 10 donors per entry increases variability by avoiding a reduced number of donors. Secondly, the replication of the PMM technique 10 times to potentially select multiple values. Thirdly, the 10-fold cross-validation trained 10 different mixgb models per target variable. Finally, we incorporated 100 fully sampled phylogenetic trees, enabling species with imputed evolutionary relationships to vary their position in the multivariate predictor space, guided by phylogenetic filters. Our goal is to illustrate the utility of multiple imputation to uncover directional bias in natural history data. However, we recognise that this section does not constitute a comprehensive investigation into the impacts of proportion of missing data and the degree of shared missingness on model performance. Delving into this aspect is beyond the scope of the present study and is an area for future research. For further details on mechanisms of data missingness, see [96,97].
Patterns of shared missing data We assessed co-occurrence patterns in missing data across species using the “checkerboard score” (C-score; [98]), which is less prone to type II errors than other co-occurrence metrics computed under a null model approach [99]. The C-score was based on a binary presence–absence matrix (PAM) of species (columns) and missing attributes (rows). These attributes were represented by five binary variables informing the absence of observed values in body length, body mass, activity time, microhabitat, and threat status. Missing data in threat status were represented by data deficient (DD) or non-assessed species. We computed the C-score using individual PAMs for each genus and family with at least two species, and each pairwise combination of attribute variables, following recommendations by [100]. To verify whether an observed C-score differed from the value expected by chance, we built a null distribution of C-score values using a randomisation algorithm in which attribute completeness (rows sums) remained fixed while the probability of showing missing attribute values was considered equal for all species. Null distributions were built for each individual PAM using 10K iterations with a burn-in of 500. We then computed the standardised effect-size (SES) of the C-score and associated p-values and identified the pairwise attribute combinations with an aggregated (SES C-score <0 and p < 0.05) or segregated (SES C-score >0 and p < 0.05) co-occurrence pattern of missing data per taxa. Computations were performed using the EcoSimR [101] package in R.
[END]
---
[1] Url:
https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002658
Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/