(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Matilda v1.0: An R package for probabilistic climate projections using a reduced complexity climate model [1]

['Joseph K. Brown', 'Joint Global Change Research Institute', 'Pacific Northwest National Laboratory', 'College Park', 'Md', 'United States Of America', 'Leeya Pressburger', 'Abigail Snyder', 'Kalyn Dorheim', 'Steven J. Smith']

Date: 2024-06

Here we introduce the software functions and basic workflow of the package (Fig 1). We use applied examples to show package functionality by assessing climate change projections from the SSP 2–4.5 emissions scenario (i.e., middle of the road SSP with the year 2100 radiative forcing level of 4.5 W/m2), providing step-by-step code and explaining the significance of each function.

Diagram detailing the Matilda workflow to compute probabilistic projections. Dotted lines indicate opportunities for the user to define their own program specification. The dashed line in step 3 indicates the ability of the user to evaluate ensemble members repeatedly with different scoring criterion.

Matilda functions are fully integrated with Hector’s R interface, and therefore when Matilda is installed and loaded, the hector package ( https://jgcri.github.io/hector/ ) is also loaded. Matilda requires the use of Hector V3.0 or newer [ 22 ].

Full descriptions of package functions can be accessed in the package’s help documentation. Furthermore, detailed documentation and vignettes are available from our GitHub repository ( github.com/jgcri/matilda ).

An analysis in Matilda begins by setting up a Hector model instance, termed a “core”. A Hector core is an object that contains information about model inputs, current state, and outputs for a specific Hector run. The information contained in an initiated core comes from an INI file holding metadata, emissions scenarios, and model parameters needed to run Hector.

Once established, the parameter sets are used as inputs for independent Hector model runs. Thus, each model run represents a multivariate parameter combination as follows: (2) where m i is an individual ensemble member and , are parameters sampled to build an independent configuration. Using different parameter sets to run Hector allows us to build PPEs and determine how different parameter combinations from a presumably viable parameter space interact to affect climate variable projections. This method effectively propagates parameter uncertainty to model ensemble uncertainty, a process described as forward uncertainty propagation [ 32 ].

Parameters can be easily added or omitted from the new parameter set data frame. For example, to run the model with a subset of parameters, undesired columns can be omitted from the data frame. This will result in a data frame that only contains parameter distributions that the user wishes to perturb. Similarly, the user can characterize additional parameter distributions and add them as new columns to the data frame, as long as the parameter is described in Hector.

In Matilda, we build parameter sets by calling generate_params(). Parameter distributions are independent of the SSP scenario, however, to run this function the user must still provide an established Hector core. Additionally, the user must specify the number of parameter sets desired (draws). Using generate_params() will produce randomized draws each time it is run. Therefore, the user should either save the resulting data frame or use set.seed() if replication of parameter sets is critical to the analysis. In this example we use our previously established core to produce a set of 25 parameter configurations and display a subset of samples from the result:

Hector parameters used to generate parameter sets in this work. The distributions are indicated as mean ± standard deviation. References from where distributions are derived are included.

The basis of running Hector in a probabilistic setup relies on establishing a set of parameter configurations that are used to run the model iteratively. Matilda uses parameter information gathered from the literature to inform prior distributions ( Table 1 ). To build parameter sets, we draw parameter values from their prior distributions using Monte Carlo sampling. Each parameter is sampled from its marginal prior distribution independently of the rest of the parameters. The prior distribution for each parameter θ i is defined using mean and standard deviation estimates as in: (1) where θ i is a given Hector parameter and N(μ, σ) is the normal distribution of parameter θ i using hyperparameters μ (mean) and σ (standard deviation). Some parameters have marginal distributions best represented using lognormal distribution, in such cases, (μ, σ) is substituted by log(μ, σ) ( Table 1 ). Using informed prior marginal distributions from the literature as a starting point for building perturbed parameter sets enables the exploration of a range of possible parameter values from the multi-dimensional parameter space judged viable on the basis of existing knowledge [ 13 , 25 ]. Once each parameter set of draws is performed, the full parameter vector values are combined using a uniform multivariate distribution. This process assumes independence of parameters, meaning for example that any value from the univariate draws of parameter θ 1 is equally likely to be paired with any value from the univariate draws of parameter θ 2…n . This parameter estimation process ultimately establishes parameter sets that account for parameter uncertainty and can be used to build an ensemble of model runs that will be evaluated against observational evidence for some salient output of those runs. In other words, we use prior information about individual parameters to build parameter sets, remaining agnostic about which sets will result in the most skilled model results until confronting Hector’s output with observed data.

While a core object and a data frame of parameter sets are the only required arguments to run iterate_model(), additional arguments can be supplied to reduce the variables and year range returned for each run using save_vars and save_years, respectively. This reduces the size of the data stored in memory, which may be important when running the model to build large ensembles (e.g., 15,000 members as in [ 13 ]). Any output variable from Hector can be returned using save_vars() for any year range subset from 1745–2300. In the following example, we supply these arguments to return values only for CO 2 concentration and global mean surface temperature anomaly for the year range 1745–2100:

The resulting data frame returns 25 separate runs, as indicated by the run_number column; in this case, the total number of rows is 55600 (25 runs × 4 output variables × 556 years). Each run includes values for the major climate variables of a Hector default output (CO 2 concentration, total radiative forcing, CO 2 forcing, and global mean air temperature) for the years 1745–2300 (the time range defined by the SSP INI file we chose above).

We run Hector for each of the parameter sets by calling iterate_model(), which runs the model for each parameter set and combines the results into a data frame object representing the new PPE. To run iterate_model(), the same core object is used as in previous steps and we also must supply the object where parameter sets are stored:

2.6 Model evaluation approach and scoring model runs

Evaluating ensemble members is important in climate model assessment because we do not know a priori which parameter sets yield realistic simulations. Evaluation procedures allow us to gain insight into the uncertainty of model outputs, thereby enhancing the fidelity of results. The concept of weighting ensemble members is intuitive; members that are more skilled (i.e., agree better with the historical record) should receive a higher weight than members that are less skilled (i.e., present larger deviations from the historical record). Ensemble members closely aligned with historical climate data will contribute more information to our probabilistic projections than members with outputs deviating significantly from the historical record. Thus, the uncertainty from varying the model parameters a priori is “filtered” through an evaluation of the parameters performance against observations. Model weights are analogous to the likelihood factor in Bayes Theorem (defined in Sec. 2.6.1) and could potentially be used to update prior parameter distributions to approximate the posterior parameter distributions. While this is a capability of Matilda, we do not discuss it at length in this paper. It is also important to note the limitations to this approach. While the weighting approaches presented here can be effective for updating marginal distributions of parameters independently, they may not fully capture the complex dependencies present in multidimensional distributions. This limitation becomes noticeable as the dimensionality of parameter space increases. In these cases, formal Bayesian approaches (e.g., Markov Chain Monte Carlo) are better suited for exploring a multidimensional parameter space.

Scoring PPE members in Matilda is conducted using score_runs() which requires (1) a results data frame, (2) a scoring criterion, and (3) a scoring function/algorithm. The results data frame typically comes from calling iterate_model(), as above.

Scoring criteria define information used to compare ensemble members against observational data. A scoring criterion can be built by the user by calling new_criterion() and simply requires the climate variable to be used in the comparison, the years of comparison, and observed data values for the years specified. For example, a new criterion can be created based on global mean surface temperature from a dataset containing observed warming values from 1990–2023:

7)

temp_data <- read.csv("example_temperature_data.csv") head(temp_data) ## year anomaly_C ## 1 1990 0.3605625 ## 2 1991 0.3388965 ## 3 1992 0.1248968 ## 4 1993 0.1656585 ## 5 1994 0.2335498 ## 6 1995 0.3768662 user_temp_criterion <- new_criterion(var = "gmst", years = temp_data$year, obs_values = temp_data$anomaly_C) print(user_temp_criterion) ## Criterion for screening Hector: gmst 1990 to 2023

This defines a custom criterion: a time series of 34 (1990–2023) values that will be compared against Hector’s “gmst” (global mean surface temperature anomaly) output variable.

The Matilda package has internally available scoring criteria for easy use, including criterion_co2_obs() and criterion_gmst_obs(). Data contained in criterion_co2_obs() is pulled from the Mauna Loa record of observed annual mean atmospheric CO 2 concentration [33], while criterion_gmst_obs() uses observed annual mean global surface temperature anomaly data retrieved from the HadCRUT5 data set [34].

Scoring functions in Matilda apply different mathematical algorithms to compute model weights based on the results and scoring criterion. We provide multiple mechanisms to weight model outputs against observations, and users can define their own custom functions as well. There are currently two internally available scoring functions called score_bayesian() and score_ramp(), that differ in functionality and computational complexity.

[END]
---
[1] Url: https://journals.plos.org/climate/article?id=10.1371/journal.pclm.0000295

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/