(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



Geometric analysis enables biological insight from complex non-identifiable models using simple surrogates [1]

['Alexander P. Browning', 'School Of Mathematical Sciences', 'Queensland University Of Technology', 'Brisbane', 'Qut Centre For Data Science', 'Mathematical Institute', 'University Of Oxford', 'Oxford', 'United Kingdom', 'Matthew J. Simpson']

Date: 2023-04

Abstract An enduring challenge in computational biology is to balance data quality and quantity with model complexity. Tools such as identifiability analysis and information criterion have been developed to harmonise this juxtaposition, yet cannot always resolve the mismatch between available data and the granularity required in mathematical models to answer important biological questions. Often, it is only simple phenomenological models, such as the logistic and Gompertz growth models, that are identifiable from standard experimental measurements. To draw insights from complex, non-identifiable models that incorporate key biological mechanisms of interest, we study the geometry of a map in parameter space from the complex model to a simple, identifiable, surrogate model. By studying how non-identifiable parameters in the complex model quantitatively relate to identifiable parameters in surrogate, we introduce and exploit a layer of interpretation between the set of non-identifiable parameters and the goodness-of-fit metric or likelihood studied in typical identifiability analysis. We demonstrate our approach by analysing a hierarchy of mathematical models for multicellular tumour spheroid growth experiments. Typical data from tumour spheroid experiments are limited and noisy, and corresponding mathematical models are very often made arbitrarily complex. Our geometric approach is able to predict non-identifiabilities, classify non-identifiable parameter spaces into identifiable parameter combinations that relate to features in the data characterised by parameters in a surrogate model, and overall provide additional biological insight from complex non-identifiable models.

Author summary Mathematical models play important roles in the interpretation of biological data. These models can be made arbitrarily complex, meaning issues related to parameter identifiability are relatively common. However, complex models with non-identifiable parameters can be useful to provide insight into the biological questions of interest, since they contain parameters of direct biological interest. In contrast, simpler identifiable models lack biological granularity and comprise parameters that relate indirectly to the underlying biology through data features. In this work, we study the interrelationship between the non-identifiable parameters in a complex model and the identifiable parameters in a simple surrogate model. We aim to resolve the mismatch between model and data complexity by utilising the simple surrogate model to provide insight in cases where the parameters of interest cannot be determined from the available data. We demonstrate our approach by analysing mathematical models of multicellular tumour spheroid growth, an experimental model of cancerous tumour growth. Using the most fundamental and commonly reported measurements, we predict non-identifiabilities arising from different data collection regimes, and draw additional insight from complex models with non-identifiable parameters.

Citation: Browning AP, Simpson MJ (2023) Geometric analysis enables biological insight from complex non-identifiable models using simple surrogates. PLoS Comput Biol 19(1): e1010844. https://doi.org/10.1371/journal.pcbi.1010844 Editor: Nicholas Mancuso, University of Southern California, UNITED STATES Received: August 7, 2022; Accepted: December 26, 2022; Published: January 20, 2023 Copyright: © 2023 Browning, Simpson. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Code used to produce the results are available on Github at https://github.com/ap-browning/spheroid_geometry. Funding: This work is funded by the Australian Research Council (https://www.arc.gov.au/) through the Discovery Project (DP200100177) awarded to MJS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Mathematical models play an important role in the interpretation of data and the design of experiments. The complexity of many experiments and biological systems means that parameters relating to key biological mechanisms cannot be directly measured, but are rather quantified through the calibration of mechanistic mathematical models to experimental observations [1, 2]. Given that biological data are often limited and noisy, model parameters provide an objective means of quantifying observations and comparing behaviours across different types of experiments or different conditions within the same experiments [3, 4]. Minimising, or at least quantifying, parameter uncertainties is, therefore, of paramount importance for effective interpretation of experimental results. A critical step in the application of mathematical models to interpret biological experiments is that of model selection [5–7]. Complex models—traditionally associated with a large number of unknown parameters—have potential to provide insights about a correspondingly large number of biological mechanisms, but often result in large parameter uncertainties when calibrated to typical experimental data [8–10]. Conversely, simpler models—including canonical models such as the logistic and Gompertz growth models—typically involve parameters that can be tightly constrained by data, but provide limited direct mechanistic insight [11]. In practice, model selection is routinely guided by information criterion; statistical metrics that quantify model parsimony, the trade-off between model fit and model complexity [7, 12]. One of many criteria used is the Akaike information criterion (AIC), given by (1) Here, is the maximum likelihood estimate (MLE), the k-dimensional parameter vector, p, that produces the best model fit, and is the maximum log-likelihood, a measure of goodness-of-fit. In essence, AIC and other information criterion penalise complex models that produce marginally better goodness-of-fit over simpler models. Typically, AIC is computed for a range of candidate models that are ranked such that the model with the smallest AIC is the most favourable. To demonstrate, we consider the growth of multicellular tumour spheroids (Fig 1A), a complex, spatially heterogeneous biological system where often only simple measurements, related to the overall size of radius of spheroids, are typically available throughout an experiment. We generate synthetic radius measurements from a mathematical model of intermediate complexity (the Greenspan model with k = 4 parameters) that was recently validated against experimental data for the first time [13, 14]. We corrupt measurements with normally distributed measurement noise with standard deviation σ and attempt to distinguish between a range of spheroid growth models, with complexity ranging from the logistic growth model (k = 2) [15] to the complex multiphase spatial model of Ward and King (k = 8) [16, 17]. In Fig 1B we set σ = 20 μm and in Fig 1C we vary σ. Once calibrated, all models lead to predictions of the spheroid radius that are visually indistinguishable (Fig 1B), and all except for Ward and King’s model are indistinguishable using AIC for a sufficiently large, and biologically realistic, noise standard deviation (Fig 1C). Full mathematical details of all models are given in Models. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Mathematical models of tumour spheroid growth. (A) Microscopy images from tumour spheroid growth experiments. Spheroids are grown from WM983b cells (a human melanoma cell line) [18], harvested, and imaged using confocal microscopy at various time points. Cells are transduced with fluorescent cell cycle indicators, showing cells in gap 1 (purple) and gap 2 (green). From day 7, a necrotic core void of living cells is evident in the spheroid centre. (B) Synthetic spheroid data generated from Greenspan’s model [13] (black discs) with additive normal noise with standard deviation σ = 20 μm (red diamonds). (B–C) Several mathematical models, including the Greenspan model, are able to match synthetic data. (C) AIC results for the model fitting exercise in (B) repeated over several values of the noise standard deviation. Shown is the mean and standard deviation from 100 repeats for each model. (D) Spectrum of the observed Fisher information matrix. Eigenvalues are shown on the log-scale and scaled such that the spectral radius is unity. https://doi.org/10.1371/journal.pcbi.1010844.g001 Aside from being unable to distinguish between models in the tumour spheroid example, criterion-based choices cannot account for the biological question—or more specifically, the biological mechanisms—of interest. For example, the logistic growth and Gompertz growth models produce an excellent match to synthetic tumour spheroid data and quantify behaviour in terms of a growth rate parameter and long-time limiting spheroid size. However, these models cannot provide information relating to the mechanisms that govern growth or determine the long-time limiting spheroid size; mechanisms such as sensitivity to and availability of oxygen and other essential nutrients. More recently, the mathematical modelling literature has moved toward tools such as parameter identifiability analysis to guide model selection [19–21]. Identifiability analysis can determine if model parameters are identifiable and can be estimated from data; both in a theoretical noise-free data limit (structural identifiability) [22–25], and in the more realistic case of a finite amount of noisy data (practical identifiability) [19, 26]. In comparison to model selection criterion like AIC, identifiability analysis provides information about the identifiability of individual model parameters. While a complex model may have a large number of non-identifiable parameters and a high AIC value, it may still prove useful provided the parameters of interest (for example, the oxygen sensitivity) are identifiable. In the vicinity of the MLE, the identifiability of model parameters can be assessed using the local curvature of the expected log-likelihood function, also known as the Fisher information matrix (FIM), denoted . The FIM is a k × k positive semi-definite matrix that quantifies the amount of information about the parameters contained in the data, and has both a statistical and geometric interpretation. Statistically, the inverse of the FIM provides a lower-bound on the covariance of parameter estimates. Therefore, a FIM that is singular corresponds to at least one model parameter that can only be estimated with infinite variance and, therefore, cannot be determined from data. Geometrically, the FIM is related to the Hessian of the log-likelihood function and therefore contains information about the directions in parameter space in which the log-likelihood (and therefore the model) is sensitive and directions in which the log-likelihood is insensitive [27]. Specifically, the eigenvalues of the FIM correspond to the curvature in the direction of the corresponding eigenvectors; eigenvectors associated with zero or near-zero eigenvalues correspond to directions in parameter space (also referred to as eigenparameters) to which the model output is insensitive [28, 29]. Conversely, eigenvectors associated with relatively large eigenvalues give informative directions; the directions to which the model is most sensitive. So-called analysis of model sloppiness is concerned with studying the spectrum of the FIM to determine the number of sloppy, or insensitive, eigenparameters in a model [8, 30–32]. To demonstrate, in Fig 1D we show the spectrum of the FIM for each tumour spheroid model. As the relative difference between eigenvalues is scale-dependent, it is difficult to interpret results from the two parameter models. However, results for the Greenspan and Ward and King models show two disparate clusters of eigenvalues, indicating a group of informative directions (corresponding to eigenvalues that are relatively large), and a group of uninformative or sloppy directions (corresponding to eigenvalues that are closer to zero). For the Greenspan model, the single insensitive direction identified from analysis of model sloppiness corresponds to a one-dimensional manifold (i.e., a curve) in parameter space along which the parameters can be identified. At the core of identifiability and sloppiness analysis is that data are unable to constrain the model parameter space to a point estimate, but rather a one- or higher-dimensional manifold [33]. Of practical application, analysis of this manifold allows for model reduction, where the number of parameters in a model can be reduced by pre-constraining or removing sloppy eigenparameters without significantly reducing the predictive power of a model [34, 35]. However, to date, analysis of the interrelationship between models using the model parameter manifold has been constrained to simpler models nested within a complex model; that is, where simpler models can be recovered by placing constraints on the parameters in the complex model, for example by setting certain parameters to zero. Examples of nested models include recovering the logistic growth model from the Fisher-Kolmogorov model by assuming the population is well mixed [36], and recovering the Gompertz or logistic growth models from Richards growth model by constraining the shape parameter [21]. Moreover, the FIM is based on the expected log-likelihood, a one-dimensional measure of overall model fit that determines manifolds in parameter space to which parameters are constrained by data or to which the model output is insensitive. FIM-based tools cannot, therefore, provide information about how features of the model output change with parameters. Our contribution is to study models with non-identifiable parameters using identifiable models that produce quantitatively similar behaviour; models that may be indistinguishable from information-criterion based analysis. To study the interrelationship between parameters in any two models (nested or non-nested), we define model equivalence in the least-squares sense, and study the associated map from the parameters in a complicated, possibly heavily-parameterised and non-identifiable model, to parameters in a simpler, identifiable model (Fig 2B). For example, we study identifiability of mechanistic ordinary and partial differential equation (ODE and PDE) models of tumour spheroid growth—relatively complicated models containing parameters quantifying nutrient sensitivities, oxygen diffusion, and oxygen consumption—through simple models like the well known logistic and Gompertz growth models that do not explicitly incorporate biophysical mechanisms that influence growth, but rather describe behaviour with largely phenomenological parameters such as the early-time growth rate and long-time limiting spheroid size. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 2. Studying identifiability through between-model geometry. (A) Typically, model parameters, are considered functions of the log-likelihood, ℓ(p), a one-dimensional metric of model fit. Non-identifiability of model parameters is characterised by insensitivity of the likelihood to a parameter or a parameter combination. (B) We consider a range of models, each parameterised by . We then study the functional relationships between parameters of different models (grey lines). https://doi.org/10.1371/journal.pcbi.1010844.g002 We demonstrate our framework through identifiability analysis of tumour spheroid data. Noisy measurements relating to the outer radius of tumour spheroids are collected (Fig 1A) and quantified with models ranging from the phenomenological logistic growth model, to detailed spatial models involving coupled nonlinear PDEs which require experimental measurements in addition to spheroid radius to parameterise [14]. We work with synthetic data generated from a model of intermediate complexity, the Greenspan model (Fig 1B), and present a series of new and existing models from the literature that produce similar agreement with the data. Initially, we focus on models with a small number of parameters so that model equivalence manifolds can be visualised in . Subsequently, we study Ward and King’s model, a model with a large number of parameters for which we must rely on non-graphical means, such as the sensitivity matrix and the Jacobian of the model link, for analysis. Aside from the requirement that the Jacobian of model outputs with respect to parameters be available, which may limit our analysis to primarily to models that are deterministic, we expect our methodology to generalise to any hierarchy of models in biology and systems biology.

Discussion The nexus between model complexity and data quantity and quality is an ongoing challenge in computational biology that is often resolved subjectively rather than objectively. While new experimental technologies are rapidly increasing the detail and resolution obtainable in biological data, mathematical models can always be made arbitrarily complex. On the other hand, data is often limited in light of the biological questions that are posed. Identifiability and sloppiness analyses have been developed to harmonise model and data complexity, to guide model selection and reduction, in order to ensure parameter identifiability [19, 56]. However, the complex, highly-detailed, heavily-parameterised models that are commonplace in mathematical and computational biology are often required to answer important biological questions: a model of tumour spheroid growth must incorporate nutrient dependencies to provide insight into the role nutrients play on growth [8, 57]. As we show, for some data only simple phenomenological models, such as the logistic and Gompertz growth models, are those that are identifiable. These models can provide excellent agreement to experimental data, allow the comparison and interpretation of experiments, however not being constructivist, provide only limited insights into underlying biological mechanisms. In many cases, simple phenomenological models produce a goodness-of-fit on par with that of a complex mechanistic model (Fig 1B and 1C). As a result, traditional model selection methodology will favour the simplicity and identifiability of the simple model, penalising the number of parameters in the complex model. Where the non-identifiable parameters in complex mechanistic models carry direct biological interpretations (the nutrient sensitivity, for instance) of prime interest to experimental scientists and biologists, the identifiable parameters in the simple model carry interpretations relating to features of the data (the early-time rate of change or the maximum spheroid size, for instance). In this work, we utilise this key difference to draw biological insight from complex mechanistic models by studying the geometry of a map from the parameters in the complex model to those in the identifiable surrogate. One interpretation of our approach is to provide an intermediate mode of interpretation that sits between the model parameters and the likelihood (or other goodness-of-fit metric) that is traditionally studied in identifiability and model sloppiness analysis. In contrast to studying the sensitivity of the model in terms of the overall fit, we effectively decompose the fit into features and study the sensitivity of model parameters to these features. This approach enables us leverage mechanistic modelling to gain insights from data that would otherwise be lost if a one-dimensional goodness-of-fit metric, such as the likelihood, were to be studied directly. We demonstrate our approach by analysing common models and typical data of tumour spheroid growth. Mathematical models of tumour spheroid experiments range from the simplistic—however routinely and effectively applied—logistic growth models [11, 58], to spatial models that can capture the density of arbitrary numbers of cell and nutrient species [13, 17, 59, 60], and to individual-based models that describe the individual behaviour of every cell in the spheroid [61–63]. Despite the complexity of even this simple experimental model of tumour growth, data often comprise only measurements of overall tumour spheroid size. More complicated experimental systems, such as in vivo vascularised tumour growth, are accompanied by a corresponding menagerie of complex models [64–66], however data from these experiments can be even more limited, noisy, or sparse in comparison with experimental models of avascular tumour growth [18]. Even the relatively simple Greenspan model, which comprises only four unknown parameters, is non-identifiable without measurements of inner spheroid structure [14]. Our goal in this work is to draw insights from such models with complexity mismatched to that of the available data. The model-data relationship is typically explored with structural or practical identifiability analysis [26]; the former in an infinite-data, model-only frame of reference, the latter in consideration of the noisy observation process that ties the model to the data. While we first establish the practical identifiability of each model, our geometric analysis does not fall into either of these classifications for a number of reasons. First, the model-map is defined in the least squares sense, and does not explicitly incorporate data. Secondly, as the surrogate model is not necessarily nested within the complex model of interest, the two are not equivalent in a meaningful infinite-noise-free-data limit. As a consequence, if the complex model is considered reality, the surrogate model produces predictions that are biased. In the context of data, we see this as an advantage as even complex models are by definition abstractions of reality. As we utilise the surrogate model to characterise features of the data, our approach is overall robust to this bias. We demonstrate this by using the logistic model as a surrogate for the Greenspan model in the main text, despite the bounded Gompertz model having a crowding function far more similar to that of the Greenspan model. Analysis using the Gompertz and Richards models (S1 File) is similar to that using the logistic model. A limitation of FIM-based identifiability and sloppiness analysis, and our sensitivity-matrix-based geometric analysis, is a restriction to providing only local information. Effectively, these techniques relate to a quadratic approximation and linearisation, respectively, about the MLE (or parameter values otherwise under consideration) of the complex model and are consequentially sensitive in cases where the corresponding likelihood is multimodal. While the manifolds relating to the map between the logistic and Greenspan models (Fig 6A and 6B) are locally linear near the parameter combination of interest, globally the manifold relating to λ appears hyperbolic. Different points on the constant-likelihood curve have the potential to produce substantially different sensitivity matrices. One approach to address this is to incorporate prior knowledge to regularise the parameter fitting problem. Recent work considers identifiability and sloppiness analysis based directly on the parameter covariance matrix estimated from Bayesian methods such as Markov-chain Monte-Carlo to provide an overall snapshot of the global parameter sensitivities [35]. However, we expect this approach to be problematic in our geometric framework, since the model-map is based on an equivalence between models that may only apply locally in the vicinity of the points used to compute the between-model sensitivity matrix. The relatively small number of unknown parameters in the Greenspan model allow us to visually explore the geometry of the parameter space using surrogate models, providing insight into non-linearities that are not captured by the model-map sensitivity matrix. However, in contrast to traditional identifiability analysis where parameters are generally classified as identifiable or not, the model-map sensitivity matrix has the ability to further classify non-identifiable parameters by which feature they relate to. In the vicinity of the parameter values of interest, this classification can allow for graphical geometric analysis even for models with more than three unknown parameters, by decomposing the parameter space into low-dimensional subsets that relate to individual features. For example, in the Ward and King model, the three parameters with the strongest correspondence to the maximum spheroid size, (λ, α, δ), can be prioritised for further analysis ahead of the full, eight-dimensional parameter space [29]. Aside from providing insight into the sensitivity of model features to parameters and predicting non-identifiability, we provide a simple demonstration of how the model-map relationship can be used to move in the parameter space to produce changes to specific model features (Fig 8C). Akin to moving in the direction of the sloppiest direction, these results show how to constrain movements in the parameter space to model feature manifolds. For heavily-parameterised models that are difficult to calibrate (perhaps, for example, due to multi-modal likelihoods), constraining movements in the optimisation algorithm used for model calibration to these manifolds allows successive matching of model features: for instance, first moving to parameter combinations that produce the desired maximum spheroid radius, and then moving on this manifold to match the growth curve shape and scale. More generally, applying surrogate models that are themselves candidate models raises interesting possibilities for future analysis. Generalisations of the logistic model, such as the three-parameter Richards model, provide a low-dimensional summary of model behaviour that can be characterised using machine learning or Gaussian processes [67]. Building up a global model-map between the complex and surrogate models is another approach to overcome the localisation limitation of our methods, and could be computationally advantageous in the case of computationally expensive complex models. Experimental data are often limited in light of the biological questions posed of them. Likewise, in the mathematical and modelling literature, complex models are numerous and often more suitable to biological questions of interest, yet can be ill-suited for parameterisation from the available data. In this work, we develop a geometric analysis to gain insights from complex, non-identifiable models using simple surrogate models with parameters that relate to features in the data. We expect our analysis to apply to any hierarchy of non-identifiable and identifiable models of biological systems.

Acknowledgments We thank Nikolas Haass and Gency Gunasingh for training A.P.B. to perform the tumour spheroid experiments that motivated this work.

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010844

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/