TITLE: Processing data from the TRY traits database
DATE: 2021-12-20
AUTHOR: John L. Godlee
====================================================================


I've been working recently with data from the TRY global plant
traits database, to assess which dominant dry tropical tree species
we have good trait data for, and for which we are lacking decent
trait data. One of the key inputs to the process-based carbon cycle
model used in the SECO project is leaf mass per area, sometimes
expressed as the leaf area per mass, aka specific leaf area (SLA),
so I'm focussing on that. If we find gaps in the trait coverage of
some species, maybe we can address those gaps with data collection
during the project.

The data requests retrieved from TRY are in a format that makes
them quite difficult to parse in R. Instead of a 2D table, it's
more like a 1D list, with metadata and trait data on different
rows, linked by an observation ID. In this post I want to share the
R code I use to create a neat dataframe from this data.

I use data.table::fread() to read in the data, because the files
can be quite large, 3.35 GB in my case:

   try_dat <- fread("dat/18017.txt", header = TRUE, sep = "\t",
dec = ".",
     quote = "", data.table = FALSE, encoding = "UTF-8")

Also note that the data are tab separated, and to fix encoding
issues it's a good idea to enforce UTF-8 encoding.

Then I rename some columns and keep the useful ones:

   try_clean <- try_dat %>%
     dplyr::select(
       obs_id = ObservationID,
       species_id = AccSpeciesID,
       species_name = AccSpeciesName,
       trait_id = TraitID,
       trait_name = TraitName,
       key_id = DataID,
       key_name = DataName,
       val_orig = OrigValueStr,
       val_std = StdValue,
       unit_std = UnitName,
       error_risk = ErrorRisk)

I create lookup tables to match the species IDs and trait IDs later
on:

   species_id_lookup <- try_clean %>%
     dplyr::select(species_id, species_name) %>%
     unique()

   trait_id_lookup <- try_clean %>%
     dplyr::select(trait_id, trait_name) %>%
     unique() %>%
     filter(!is.na(trait_id))

Then I split the data by observation ID:

   try_split <- split(try_clean, try_clean$obs_id)

Then I loop through each of those observations, extracting the
trait data and some useful metadata that is commonly attached to
each observation. But note that there are lots of metadata in TRY,
and not all observations share all metadata. A lot don't even have
latitude and longitude coordinates, limiting their usefulness.

   total <- length(try_split)
   try_df <- as.data.frame(do.call(rbind,
mclapply(seq_along(try_split), function(x) {
     message(x, "/", total)
     x <- try_split[[x]]
     # Subset columns
     traits <- x[!is.na(x$trait_id),
       c("species_id", "trait_id", "val_orig", "val_std",
"unit_std", "error_risk")]

     # Extract some common metadata
     meta_ext <- function(y, key_val) {
       ext <- y[y$key_id == key_val, "val_std"]
       ifelse(length(ext) == 0, NA, ext)
     }

     traits$elev <- meta_ext(x, 61)
     traits$longitude <- meta_ext(x, 60)
     traits$latitude <- meta_ext(x, 59)
     traits$map <- meta_ext(x, 80)
     traits$mat <- meta_ext(x, 62)
     traits$biome <- meta_ext(x, 193)
     traits$country <- meta_ext(x, 1412)

     return(traits)
   }, mc.cores = 3)))

Finally, I can add the trait and species names back in using the
lookup tables:

   # Add trait and species names to dataframe
   try_df$trait_name <- trait_id_lookup$trait_short[
     match(try_df$trait_id, trait_id_lookup$trait_id)]

   try_df$species_name <- species_id_lookup$species_name[
     match(try_df$species_id, species_id_lookup$species_id)]