TITLE: Data management during and after my PhD
DATE: 2021-09-20
AUTHOR: John L. Godlee
====================================================================


During my PhD I spent a lot of time collecting field data. Along
with colleagues from Angola I setup 15 1 ha permanent woodland
survey plots in Bicuar National Park, Angola, where we conducted a
census of all woody stems >5 cm diameter, and made additional
measurements of the grass biomass and tree canopy cover. These
plots will hopefully have another census in 2023. I also collected
terrestrial LiDAR data in 22 1 ha woodland plots, all 15 in Bicuar
National Park and an additional seven in southern Tanzania, to
quantify canopy complexity.

I think these two datasets form a key product of my PhD thesis. PhD
students often generate a lot of data, but only a minority of PhD
students develop a long-term plan for data management and data
dissemination. I chose to write an extra chapter in my thesis all
about the multiple uses of the data I collected, and it's
contribution to the greater good of the field, but it requires that
the data are properly archived, managed, and advertised, otherwise
nobody else will want to use it. Like the investigative chapters of
the thesis, which I hope can be converted into manuscripts for
peer-review, extending their lifespan and making them more
impactful by reaching a larger audience, I hope that I can ensure
the data I collected during my PhD has a legacy beyond just my PhD
thesis.

 [generate a lot of data]: https://hal.univ-lille.fr/hal-01248979
 [data management]:
https://www.researchgate.net/publication/305395587_Research_traditio
ns_and_emerging_expectations_PhD_students_and_their_research_data_ma
nagement

Lots of universities have a data management plan web page, these
are some of the first results from Google for "phd data management":

-   University College London
-   University of York
-   University of Sheffield
-   University of Bath
-   University of Leeds
-   University of Liverpool
-   University of Birmingham
-   University of Bristol
-   University of Exeter
-   University of Southampton

 [University College London]:
https://www.ucl.ac.uk/library/research-support/research-data-managem
ent/policies/writing-data-management-plan
 [University of York]:
https://www.york.ac.uk/library/info-for/researchers/data/planning/
 [University of Sheffield]:
https://www.sheffield.ac.uk/library/rdm/dmp
 [University of Bath]:
https://library.bath.ac.uk/research-data/data-management-plans/unive
rsity-dmp-templates
 [University of Leeds]:
https://library.leeds.ac.uk/info/14062/research_data_management/62/d
ata_management_planning
 [University of Liverpool]:
https://libcal.liverpool.ac.uk/event/3671658
 [University of Birmingham]:
https://intranet.birmingham.ac.uk/as/libraryservices/library/researc
h/rdm/data-management-plans.aspx
 [University of Bristol]:
http://www.bristol.ac.uk/staff/researchers/data/
 [University of Exeter]:
https://www.exeter.ac.uk/research/researchdatamanagement/before/plan
s/
 [University of Southampton]:
https://library.soton.ac.uk/researchdata/phd

The University of Edinburgh, where I did my PhD, also has one, but
I didn't see it until writing this blog post, a week after handing
in.

Writing a DMP | The University of Edinburgh

At the end of the first year of my PhD I wrote a "Confirmation
Report". In other institutions I've heard them referred to as
"Upgrades". It's sort of a friendly examination that makes sure you
have a developed plan for what to do during the PhD, before it's
too late. You write a report that's part literature review and part
methodology proposal, then have a mini viva with some other
academics. I always felt like my confirmation report should have
required a data management plan, similar to how it required an
ethics assessment and a timeframe, but it didn't. We did have a
short presentation on data management during the "Research Planning
and Management" course in the first year of the PhD, which
consisted mostly of information on how to store data on the
University network. I would have liked to see more guidance on how
to manage and archive large volumes of data (TBs), both during and
after the PhD, to ensure that data is usable by others, and by your
future self.

For the plot census data, which was only a couple of GBs, I have
stored the data in three places:

-   University datastore - accessible by ssh, backed up regularly
by the University
-   Hard drive stored at my parent's house
-   Hard drive stored at my house

This conforms to the 3-2-1 backup rule, which recommends keeping at
least 3 copies of the data, with at least 2 different media types
(hard drive, network share), and store 1 copy off-site (I have two
different off-site locations, University, parent's house). I also
have "cleaned" versions of the plot census data hosted on the
SEOSAW database, which makes the data accessible to other
researchers under agreement. I've already had a few other projects
request to use the data, which is very nice to see.

 [3-2-1 backup rule]:
https://en.wikipedia.org/wiki/Backup#3-2-1_rule
 [SEOSAW database]: https://seosaw.github.io/

One thing that I didn't keep good track of for a little while was
only using one copy as the 'primary' copy, and using the others as
backup only. At one point I was writing new data to both my
personal hard drives and I got mixed up. Since then, I put one of
the hard drives in a cupboard out of site, to deter me from writing
data to it unless I wanted to do a backup. As an aside, I use rsync
to make backups. It's quick and efficient and very rarely fails. I
have plans to buy a NAS (Network Attached Storage), the Synology
DS420+ looks nice, but for now having loose hard drives will have
to suffice.

 [rsync]: https://www.google.com/search?hl=en&q=rsync
 [Synology DS420+]: https://www.synology.com/en-us/products/DS420+

The LiDAR data consists of raw .zfs files exported directly from
the scanner, databases built by Cyclone (Leica's proprietary LiDAR
processing software), PTX files outputted by Cyclone, and LAZ files
created by me which compress the huge PTX files to a more
manageable binary format.

The key items to keep in my opinion are the raw .zfs files, and the
PTX files, as they constitute the raw untouched data in open
formats, but the LAZ files are the ones I'll probably use most on a
day to day basis, simply because they are small enough that drive
I/O isn't a bottleneck for processing time.

I've got the LAZ files backed up in the same places as the plot
census data, and also in a DataShare repository, which gives them a
permanent DOI and makes them available for others to use. The scan
databases I don't think I will back up, because every aspect of
information in them is represented in some other file. The only
convenience of keeping them is that I would be able to quickly boot
up Cyclone and use their very good 3D rendering, but Cloud Compare
is enough for me most of the time. The PTX files I have backed up
both on my personal hard drives, and also on cassette at the
University, a service which costs about £50 per pair of tapes I
think, which is very reasonable. This isn't perfect as the cassette
backup isn't that accessible, but the PTX files are just so big
that it's difficult to keep them anywhere else. As long as I have
two sets of hard drives, each stored in different places, they
should be safe.

 [DataShare repository]:
https://datashare.ed.ac.uk/handle/10283/3997
 [Cloud Compare]: https://www.danielgm.net/cc/