TITLE: An R function to split species names
DATE: 2020-06-05
AUTHOR: John L. Godlee
====================================================================


For my research assistant position I have been cleaning lots of
taxonomic data for tree species in southern Africa. On the surface
this seems simple, Brachystegia spiciformis gets split into
c("Brachystegia", "spiciformis"). However, what about when the
species is written as Brachystegia spiciformis var. kwangensis?
Here is a list of possible species name forms I found in my dataset:

-   Brachystegia spiciformis
-   Brachystegia cf. spiciformis
-   Acacia abyssinica subsp. calophylla
-   Acacia sieberiana var. woodii

And that isn't counting the species with multiple below-species
taxonomic ranks, like: Vachellia gerrardii subsp. gerrardii var.
latisiliqua.

Separating these out by hand would take a very long time, so I
wrote a function which does it for me.

First the function splits strings by spaces or optionally dots with
no spaces, then it searches to see if a species is cf., meaning
that the absolute species isn't known but a guess has been made, in
which case species is replaces with indet (indeterminate) and the
species is stored in the confer column. Then a similar process to
search for both varieties and subspecies. If below-species ranks
are to be returned then the dataframe is returned as is, otherwise
the confer column replaces the indet in species if below-species
ranks are not returned.

This function doesn't catch Brachystegia sp.2, but I have a
separate function which replaces these with Brachystegia indet
based on a lookup table supplied by the user.

   #' Split full species name into genus, species, and optionally
below-species taxonomic ranks
   #'
   #' @param x vector of genus and species names
   #' @param subsp logical, should lower taxonomic ranks be
returned?
   #'
   #' @return dataframe of character vectors with one column per
rank
   #'
   #' @export
   #'
   splitSpecies <- function(x, subsp = TRUE) {
     x <- strsplit(x, " |[a-z]\\.[a-z]")

     x <- lapply(x, function(y) {
       # genus
       genus <- y[1]

       # cf and species
       if (grepl("cf(\\.)?", y[2])) {
         species <- "indet"
         cf <- y[3]
         plus <- 1
       } else {
         species <- y[2]
         cf <- NA_character_
         plus <- 0
       }

       if (!is.na(y[3+plus])) {
         sub_string <- paste(y[(3+plus):length(y)], collapse = " ")

         # variety if present
         if (grepl("var(\\.)?", sub_string)) {
           string <- strsplit(sub_string, " ")
           variety <- string[[1]][which(grepl("var(\\.)",
string[[1]])) + 1]
         } else {
           variety <- NA_character_
         }

         # subspecies if present
         if (grepl("subs(p)?(\\.)?", sub_string)) {
           string <- strsplit(sub_string, " ")
           subspecies <- string[[1]][which(grepl("subs(p)?(\\.)?",
string[[1]])) + 1]
         } else {
           subspecies <- NA_character_
         }
         c(genus, species, cf, subspecies, variety)
       } else {
         c(genus, species, cf, NA_character_, NA_character_)
       }
     })

     out <- as.data.frame(do.call(rbind, x))
     names(out) <- c("genus", "species", "confer", "subspecies",
"variety")[1:length(out)]

     # Replace cf. as species is subsp. == FALSE
     if (subsp) {
       out <- out
     } else {
       out$species[!is.na(out$confer)] <-
out$confer[!is.na(out$confer)]
       out <- out[,c("genus", "species")]
     }
     return(out)
   }