TITLE: Empirical Cumulative Density Function (ECDF) plots
DATE: 2021-04-15
AUTHOR: John L. Godlee
====================================================================


Twice recently I've wanted to check the proportional coverage of a
dataset. For example, I have a dataset where trees are nested
within plots. I wanted to check what proportion of trees within
each plot had measurements in the TRY global trait database for
leaf nitrogen per dry mass. Specifically, I wanted to check what
proportion of plots had at least a given proportion of trees or
more represented in TRY.

 [TRY global trait database]: https://www.try-db.org/

I could have visualised this with a histogram, but it's not very
visually appealing because it requires some mental gymnastics to
add up each bin greater than the minimum threshold:

 ![Histogram of TRY representation in
plots](https://johngodlee.xyz/img_full/ecdf/hist.png)

A better method I've found is an Empirical Cumulative Density
Function (ECDF) plot.

First I have the dataframe, where each row is a tree, with plot ID,
species, and mean leaf nitrogen per dry mass for that species. If a
species doesn't have a nitrogen measurement, it's recorded as NA:

 plot     species                 leaf_n
 -------- ----------------------- --------
 JLG_10   Combretum collinum      24.29
 JLG_10   Pericopsis angolensis   NA
 JLG_10   Pericopsis angolensis   NA
 JLG_10   Burkea africana         20.72
 JLG_10   Burkea africana         20.72
 JLG_10   Burkea africana         20.72

Then I calculate the proportion of trees with trait measurements:

   # Split stems data by plot
   dat_split <- split(dat, dat$plot)

   # For each plot, proportion of trees and ba represented for
each trait
   prop <- do.call(rbind, lapply(dat_split, function(x) {
     n_trees <- nrow(x)

     data.frame(
       plot = unique(x$plot),
       n_trees,
       prop_leaf_n = nrow(x[!is.na(x$leaf_n),]) / n_trees
     )
   }))

Then calculate what percentage of plots have above a minimum
threshold of trees with measurements:

   thresh <- 0.8

   prop_plots <- sum(prop$prop_leaf_n > thresh) / nrow(prop)

Then I can make an ECDF plot using {ggplot2}:

   ggplot() +
       geom_line(data = prop,
       aes(x = prop_leaf_n, y = 1 - ..y..),
       stat = "ecdf", pad = FALSE) +
       geom_vline(xintercept = thresh, linetype = 2, colour =
"red") +
       geom_hline(yintercept = prop_plots, linetype = 2, colour =
"red") +
       labs(x = "Proportion of trees", y = "Proportion of plots")
+
       theme_bw()

 ![ECDF plot, proportion of plots with proportion of trees with
leaf N measurements](https://johngodlee.xyz/img_full/ecdf/ecdf.png)

From the plot above it's easy to see that in 31% of the plots at
least 80% of trees have at least one leaf nitrogen measurement in
the TRY database. It's also easy to see the trend in the coverage.
Hardly any plots have all trees covered by leaf nitrogen
measurements, and it really drops off a cliff at about 98% tree
coverage.

The plot above is generated from a real dataset of ~1250
woodland/savanna plots in southern Africa. It's easy to see that
trait coverage is not great in southern African species, especially
given that leaf nitrogen is one of the most commonly collected
traits that can tell us a lot about life history strategy and place
am individual on the leaf economic spectrum. When you aggregate to
genus coverage gets a bit better, but I am sceptical about how much
I can trust a trait estimate that is an aggregate of species
potentially found across the world. Additionally, these graphs only
tell part of the story, because they only count whether a tree has
at least one trait measurement in TRY. To get decent estimates of
the mean of a trait like leaf nitrogen content you need LOTS of
replication, preferably >10,000 data points to account for all the
bias that can be introduced by site, season, leaf age, tree age,
leaf location on the tree, measurement error, etc.