TITLE: Flow diagram for data cleaning process
DATE: 2020-06-10
AUTHOR: John L. Godlee
====================================================================


I've been involved in improving the data cleaning process for the
SEOSAW database. I created an R package with a load of data
cleaning functions and designed a workflow for ingesting new data
into the database. Today I faffed around for a while making a flow
diagram to keep track of all the functions.

 [SEOSAW]: https://seosaw.github.io/

I opted for a nested function design. At the bottom end there are
many functions which do very simple things like check whether a
particular column contains the correct factor levels (e.g. diam()).
These functions are the nested within a function which checks all
of the column contents in a dataset which is to be ingested into
the database (e.g. colValCheck()). At the top there are a small
number of high level functions which perform the checking
formatting and adding new columns all in one (e.g. stemTableGen()).
I'm not sure whether this is overly complicated or not, but to me
it seems reasonably intuitive, made easier with some good
documentation. It also affords the user a lot of flexibility in how
they construct their workflow.

The idea is the diagram will be hosted on the SEOSAW website along
with a vignette and the package manual, as well as on the Bitbucket
repository for the SEOSAW dataset, so that users can use it as a
quick reference when cleaning their own data, either for inclusion
into the SEOSAW database, or for comparing their own data with the
SEOSAW repository.

 ![Flow diagram for data
cleaning](https://johngodlee.xyz/img_full/package_diagram/diagram.jp
g)