TITLE: Questions about operational IT for research
DATE: 2020-06-25
AUTHOR: John L. Godlee
====================================================================


I have a couple of open questions, similar to previous questions
posed in my lab group on how folk set up their R environments. I
think that discussions like this are a good way of developing a
sense of collegiality in academic groups. Often discussion of our
specific research is stymied by feelings that it has to be perfect
before talking to our colleagues about it, but discussing
operational topics like data management, data analysis, etc. is an
effective way of sharing experience and making all our lives
easier. Once the boring day-to-day topics are taken care of in the
most optimal fashion, the hard work of research becomes slightly
more pleasurable.

First question: How does one manage large file storage, rasters and
the like? I currently download large spatial data to my local
machine for analysis, but then my laptop periodically runs out of
hard disk space, and I have to delete various layers. Then
inevitably I need some file again and I have to figure out where I
got the old file from, or I want to run an old analysis and find
that I carelessly deleted an important file of raw data.

I've tried keeping files on Google Drive but this is a pain because
the large files choke up the syncing on my domestic internet
connection. I've tried keeping files on my university datastore,
but the upload/download speed when not on the University network is
very frustrating. At the moment I have large files on a networked
home server, but there are two major caveats to my approach:
firstly if I ever decide to work away from home I will no longer
have access to those files, and secondly I do not have enough hard
drive space for redundancy, so if my spinning disk hard drives
fail, that's it.

As a side note on the question above and the R environment
question, I got concerned about how much IT infrastructure my
university is pushing onto employees and students. The general
consensus among my lab group on the R environment question was that
the University managed R environment, as installed using the
'Application Catalog' is unusable for real research, due to an
issue with managing packages. One lab group member said that when
they talked to IT about it they were recommended to just not use
the University R environment. This surely is a service which should
be provided to all at the University without question?! Another
story is from a lab group who decided that it was easier to buy
their own high-spec image rendering desktop machine rather than
deal with the University's poorly managed cluster computer setup.
Finally there are all the PhD students in my office who choose to
use their own laptops, keyboards and mice, which presumably they
paid for themselves, rather than the terrible network-managed
all-in-one Desktop PCs and low-end chiclet keyboards. My Desktop PC
was pushed to the back of my desk after about 2 weeks of work in
favour of my own laptop and an external display.

Second question: How does one create a truly reproducible QGIS
workflow, which keeps a record of the intermediary object created,
the processes which create them and the inputs provided?

I was recently clipping, diffing and dissolving a bunch of
different spatial vector files to create a data-informed study
region which will define the spatial bounds of part of my research.
Normally I do these things in R, but this time I needed to do a
fair amount of manual clicking so I opted for QGIS. Looking back if
I had looked at each operation more carefully I probably could have
got away with not doing any manual clicking, but I was short on
time and will power. What I would really like is to export a
script, maybe in Python because I know QGIS already interacts well
with Python, which shows exactly what I did, down to the manual
clicking for the creation of free-hand polygons, to produce what I
created manually.