TITLE: Comparing which simulated distribution is closest to the
truth
DATE: 2020-10-05
AUTHOR: John L. Godlee
====================================================================


I had an email from a colleague who had a load of univariate
distributions generated from different landscape evolution
simulations. They wanted to compare which distribution was closest
to the observed distibution in the real landscape.

To begin with, the colleague suggested using t-tests to see if two
distributions were similar, but I think there is a problem with
using t-tests in that way. As I interpret it, t-tests analyse
whether two distribution means are appreciably different, rather
than similar. Anyway, this was my discussion:

 So, you have many simulated elevation distributions, and you want
to know which distribution closest to the true observed
distribution? In that case, I definitely take issue with using K-S
(Kolmogorov-Smirnov) tests or many one-sample t-tests, as they just
don’t fit the framing of your question. Both these will test
whether the sample mean of a simulated distribution is appreciably
different from the observed mean in the original landscape, using a
p-value test. Firstly, it’s not as simple as reversing the logic
to say: “if you have a non-significant p-value then they are
appreciably similar”. Secondly, you don’t want to know whether
each simulation is suitable (i.e. similar) or not, you want to know
which is most suitable. With the above tests all simulations might
be suitable, and given the large sample size I would wager that
they all would be, but you wouldn’t know which was the most
sitable.

 As an aside, before resigning yourself to non-parametric tests,
it might be worth seeing if a simple transformation of the data can
achieve normality, or at least normal-ish. Try log-transforming or
sqrt-ing. With a sample of 3000 points I don’t think having truly
normal data is such a big deal anyway, and there is a lot of
literature to support linear models being robust to violations of
normality.

 Equivalence testing may be what you are looking for. It tests
instead whether two distributions are suitably similar, but it
won’t be able to tell you which one is most similar to the truth.

 My favourite approach however, having mulled this over a good
deal, is this:

 -   Transform distributions to achieve a normal-ish distribution.
 -   Do a linear regression of observed (original landscape,
dependent, y) against predicted (simulated, independent, x) values.
 -   Compare those regression models using AIC or some other
information criterion to find the regression which minimises
variance between predicted and observed values. The simulated
values used in that regression are from the “best” simulation.
You could also report R^2 values, which should be highest in the
model with the lowest AIC. Note it is generally accepted that if
two models are within 2 AIC points of each other, you cannot say
that one is better than the other, thus you may have multiple
simulations which are the most suitable.

 I’ve included an R script which does similar with a bunch of
fake normal-ish data.

 For data visualisation, I would:

 -   Transform distributions to achieve a normal-ish distribution.
 -   Make an interval plot or a boxplot showing the mean and
standard deviation of each simulated distribution compared to the
true mean.

 My attempt is also in the R script.

 Thanks for the interesting problem, let me know what you end up
using, John

Here is the R script I referenced:

   # Which elevation distribution is closest to the truth?
   # John Godlee ([email protected])
   # 2020-10-05

   set.seed(20201005)

   # Packages
   library(dplyr)
   library(tidyr)
   library(ggplot2)

   # Create data
   distrib_list <- list(
     obs = rnorm(1000, 144, 5),
     el1 = rnorm(1000, 110, 5),
     el2 = rnorm(1000, 120, 5),
     el3 = rnorm(1000, 130, 5),
     el4 = rnorm(1000, 140, 5),
     el5 = rnorm(1000, 150, 5),
     el6 = rnorm(1000, 160, 5)
   )

   ##' 6 simulated distrib. and 1 observed distrib.. Each has:
   ##'   1000 points,
   ##'   standard deviation of 5
   ##'   varying mean

   # Create density plot
   do.call(cbind, distrib_list) %>%
     as.data.frame(.) %>%
     gather(., key, value) %>%
     ggplot(.) +
       geom_line(stat = "density",
         aes(x = value, colour = key))

   ##' Expect that `el4` is most similar to `obs`

   # Run linear regressions
   mod_list <- lapply(distrib_list[-1], function(y) {
     lm(y ~ distrib_list[[1]])
       })

   # Extract AIC, BIC, R^2
   mod_df <- data.frame(name = names(distrib_list[-1]))
   mod_df$aic <- lapply(mod_list, AIC)
   mod_df$bic <- lapply(mod_list, BIC)
   mod_df$rsq <- lapply(mod_list, function(x) {
summary(x)$r.squared })

   ##' There are loads of other metrics to choose from

   # Look at the extracted metrics
   mod_df

   # Visualise data
   do.call(cbind, distrib_list) %>%
     as.data.frame(.) %>%
     gather(., key, value) %>%
     mutate(sim = case_when(
         grepl("el", key) ~ TRUE,
         TRUE ~ FALSE)) %>%
     ggplot() +
       geom_boxplot(data = raw_dat,
         aes(x = key, y = value, colour = sim), fill = NA) +
       stat_summary(data = raw_dat,
         aes(x = key, y = value, fill = sim),
         fun = mean, geom = "point", shape = 21, size = 5, colour
= "black") +
       theme(legend.position = "none")