TITLE: `BIOMASS::getWoodDensity()` description

TITLE: `BIOMASS::getWoodDensity()` description
DATE: 2022-11-23
AUTHOR: John L. Godlee
====================================================================

I looked through the code for the getWoodDensity() function from
the {BIOMASS} R package, to get a better idea of how it estimates
tree wood density.

[R package]:
https://cran.r-project.org/web/packages/BIOMASS/index.html

The function uses the wood density database from the Zanne et al.
(2009) global wood density database. The database contains 16467
records from 8388 taxa.

[Zanne et al. (2009) global wood density database]:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.234

You feed the function a set of species names, and optionally family
names, stand IDs (plots), extra wood density data supplied by the
user, and a geographic region with which to subset the wood density
database.

First, getWoodDensity() filters to wood density database to only
include the families, genera, or species present in the wood
density database. This means that an entry in the input data where
family = "Fabaceae", genus = "Brachystegia", and species =
"spiciformis", the filtered wood density data will include records
from the species Acacia acuminata, because this species is in the
Fabaceae family.

Then, getWoodDensity() calculates the mean wood density for each of
those species. These species-level estimates are used when there is
a match in the input data.

The function then calculates mean wood density for each genus by
taking the mean of the species-level means calculated in the
previous step. This means that in the case where species within a
genus have more than one record, the genus-level mean calculated by
the function will be slightly different to the grand mean
calculated from the raw wood density records. This is a good
example of Simpson's paradox. Additionally, the values of nInd
generated for these genus-level estimates are the number of
species-level means, not the number of raw records. These
genus-level means are used when there is a match in the input data
that hasn't yet been filled by a species-level estimate.

[Simpson's paradox]:
https://en.wikipedia.org/wiki/Simpson%27s_paradox

For genus-level indets (e.g. Fabaceae indet) the function
calculates mean wood density for the family, again by taking the
mean of genus-level estimates calculated in the previous step.
Again, these family-level means are used where there is a match in
the input data that doesn't yet have an estimate of wood density.

For family-level indets (i.e. no taxonomic information provided) it
calculates mean wood density for the plot, using the best wood
density estimates generated for each individual in previous steps.
This means that if only a single individual in the plot has a wood
density estimate, this value will be used as the wood density
estimate for all other individuals on the plot.

In the case where no wood density estimates have been generated for
that plot, i.e. no stems on the plot have taxonomic information,
the mean of all other best wood density estimates generated for
other plots is used.

In the case where no wood density estimates have been generated for
any plots, because none of the submitted taxa have records in the
database, the function fails.

The way the function is written it makes sense that if you are
generating wood density estimates for multiple sites, which might
differ in their species pool, you should run the function site by
site in a loop or a lapply(), so that the wood density estimates
generated at genus-level and above are calculated using species
which are likely to occur in that site.

In the event that the function fails due to lack of data, it might
still be possible to generate an estimate of wood density for the
site, either the regional mean, or even the global mean from the
wood density database.

I think the function might be doing the wrong thing by calculating
means of means, rather than the grand mean. For example, imagine a
genus with two species. Species "A" has 1000 wood density
measurements with a mean of 2.56. Species "B" has 5 measurements
with a mean of 1.5. The function is estimating the wood density of
an average species within the genus (2.56 + 1.5)/2. If we instead
take the grand mean ( (1000*2.56) + (5*1.5) ) / 1005 we are
estimating the wood density of an average individual. As the
function eventually applies these estimates to individuals from the
input data, it makes more sense to me to calculate the grand mean.