(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Splitting Gaussian processes for computationally-efficient regression
['Nick Terry', 'Department Of Industrial', 'Systems Engineering', 'University Of Washington', 'Seattle', 'Wa', 'United States Of America', 'Youngjun Choe']
Date: 2021-09
Abstract Gaussian processes offer a flexible kernel method for regression. While Gaussian processes have many useful theoretical properties and have proven practically useful, they suffer from poor scaling in the number of observations. In particular, the cubic time complexity of updating standard Gaussian process models can be a limiting factor in applications. We propose an algorithm for sequentially partitioning the input space and fitting a localized Gaussian process to each disjoint region. The algorithm is shown to have superior time and space complexity to existing methods, and its sequential nature allows the model to be updated efficiently. The algorithm constructs a model for which the time complexity of updating is tightly bounded above by a pre-specified parameter. To the best of our knowledge, the model is the first local Gaussian process regression model to achieve linear memory complexity. Theoretical continuity properties of the model are proven. We demonstrate the efficacy of the resulting model on several multi-dimensional regression tasks.
Citation: Terry N, Choe Y (2021) Splitting Gaussian processes for computationally-efficient regression. PLoS ONE 16(8): e0256470.
https://doi.org/10.1371/journal.pone.0256470 Editor: Enrico Scalas, University of Sussex, UNITED KINGDOM Received: April 3, 2021; Accepted: August 6, 2021; Published: August 24, 2021 Copyright: © 2021 Terry, Choe. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All data are available publicly at
https://github.com/nick-terry/Splitting-GP. This URL can also be found within the manuscript. Funding: Y.C. was awarded two grants from the National Science Foundation (NSF) for this project. See
https://www.nsf.gov/. Specifically, funds from the grants CMMI-1824681 and DMS-1952781 were used for this research work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.
Introduction Gaussian process (GP) regression is a flexible kernel method for approximating smooth functions from data. Assuming there is a latent function which describes the relationship between predictors and a response, from a Bayesian perspective a GP defines a prior over latent functions. When conditioned on the observed data, the GP then gives a posterior distribution of the latent function. In practice, GP models are fit to training data by optimizing the marginal log likelihood with respect to hyperparameters of the kernel function, k. A notable shortcoming of GP regression is its poor scaling both in memory and computational complexity. In what we will hereafter refer to as full GP regression, the posterior mean and variance can be directly computed at any finite number of points using linear algebra [1]. However, a computational bottleneck is caused by the necessary matrix inversion of the kernel matrix K, which is well-known to have a time complexity of , where n is the number of observations. The memory complexity of storing K may also pose issues for massive data sets. GP regression may also struggle to well-approximate latent functions which are non-stationary [2, 3]. Non-stationary latent function’s mean or covariance may vary over its domain. Non-stationarity may be induced in more subtle ways as well, such as heteroscedastic additive noise in the observed response. Local GP regression is a class of models which address both of these problems, to varying extents. The commonality of these models is the assignment of observations to one of many local GPs during training, and the aggregation of the local models’ individual predictions. As a result of observations being assigned to a single local model, effectively only a block-diagonal approximation of the full kernel matrix K is maintained [4, 5], easing the time and memory complexity. The price of this flexibility and computational advantage is a potential decrease in predictive ability, relative to full GP regression, on tasks for which a full GP is well-suited. A successful method for assigning observations to local GPs is partitioning the input space, the domain of the latent function, and creating a local GP model for each cell of the partition [2–4, 6]. Existing local GP models may encounter various difficulties during the partitioning process and/or neglect a setting where only a sequential data source is available and partitioning must be performed sequentially. This sequential setting is important since it encompasses tasks with changing dynamics. For example, in applications such as process quality monitoring [7, 8] and motion tracking [9], a sequential approach allows for the model to adapt to regimes which are yet-unobserved. The primary contribution of this paper is an algorithm which recursively partitions the input space to construct local GPs using sequential observations. The resulting model is dubbed the splitting GP. The algorithm is shown to have superior asymptotic time and memory complexity relative to other state-of-the-art local GP methods which can be used in this problem setting. By design of the algorithm, we also ensure an exact upper bound on the time complexity of updating the splitting GP model. We also prove theoretical properties of the model related to continuity of the predictions and empirically demonstrate the efficacy of the model. Additionally, a software implementation of the algorithm is provided, which leverages the computational advantages of the GPyTorch library [10], to facilitate the use of the algorithm by others.
Related work The first exploration of local GP regression is due to Rasmussen and Ghahramani [11], as a special case of their mixture of GP experts model, where prediction at a point is performed by one designated “expert” GP. Snelson and Ghahramani [12] further developed this idea, but only as a supplementary method to sparse GP regression. An adjacent line of research on treed GP models was begun by Gramacy and Lee [2]. Treed GP models perform domain decomposition in the same manner as classification and regression trees [13], and fit distinct GP models to each resulting partition. Predictions are then formed using k-nearest neighbors. Gramacy and Lee [2] found that local GP regression methods, such as treed GP models, were well-suited to non-stationary regression tasks since each leaf GP model in the tree could specialize to local phenomena in the data. Since the inception of this method, further advances have been made by Gramacy and collaborators [3, 6], particularly with respect to large scale computer experiments and surrogate modeling. It is acknowledged by Gramacy and collaborators that predictions by the treed GP model are not continuous in the input space [2, 3]. Additionally, the construction of the tree is performed probabilistically, and does not make use of the intrinsic structure of the data for domain decomposition. Since the original proposal of local GP regression, several methods have been proposed which are adapted to the sequential setting. Shen et al. [14] reduced the prediction time and kernel matrix storage for isotropic kernels by building a kd-tree from the training data set. Nguyen-Tuong et al. [15] proposed a method of local GP regression for online learning, which assigns incoming data to local models by similarity in the feature space and forms mean predictions by weighting local predictions. This local GP model was among the first to consider a fully sequential setting. This model has two notable drawbacks: it suffers from discontinuities in its predictive mean, and depends on a hyperparameter which is difficult to tune in practice and strongly affects prediction performance. Another local GP method which may be used in the sequential setting is the robust Bayesian committee machine (rBCM) by Deisenroth et al. [16], which can be seen as a product of GP experts [17]. The rBCM emphasizes rapid, distributed computation over model flexibility. This is demonstrated by a) its assumption that the latent function is well-modeled by a single GP and b) consequent random assignment of observations to GP experts, rather than a partitioning-based approach. This modeling approach does not address potential non-stationarity of the latent function. Streaming sparse GP approximations [18] use variational inference to select n 0 ≪ n pseudo-inputs to create an approximation of the full GP with constant-time updating. However, this method neglects the possibility of non-stationarity. More recently, Park and collaborators have done significant work in this area, applying mesh generating procedures from finite element analysis [19, 20] and the recursive Principal Direction Divisive Partitioning algorithm [4, 21] to partition the input space for fitting local models. However, in these papers it is assumed that a substantial number of observations are available during the initial model construction to perform the partitioning procedure, particularly when using mesh generation methods. These models also suffer from discontinuities at the boundaries of partition cells, an issue which the authors have creatively addressed by adding constraints to the hyperparameter optimization which force equality at finitely many boundary points, or by adding pseudo-points at the boundaries to induce continuity.
Theoretical properties of the splitting GP model Continuity of the splitting GP As a consequence of this weighting procedure and the choice of priors for the child GPs, the splitting GP model has some convenient properties. The proofs of the following propositions may be found in the Appendix. Proposition 1. Let be the full GP prior to splitting and let x* be a test input. The child GPs from the first split have the property that, prior to being updated with any new observations, . That is, the predictive distribution is preserved by the splitting procedure. This result is somewhat intuitive, since no additional evidence has been obtained which might alter our Bayesian beliefs about the latent function; we have simply altered the model structure. In agreement with this idea, the equality of the parent and childrens’ predictive distribution no longer holds after any new data is assigned to one of the child GPs. Note that, in general, this relationship does not hold for any split after the first. An important property of the splitting GP model is the preservation of continuity properties of the child GPs. We provide a result which shows the necessary and sufficient condition for continuity of the splitting GP model’s predictions. Proposition 2. Suppose k is a kernel function and . Then the random field given by is mean square continuous in the input space if and only if the kernel function, k, is continuous. Under the same condition, the mean prediction is also continuous in the input space. While it may seem obvious that the mean prediction is continuous in the input space, it is reassuring to know that the random field created by the splitting model has the same sufficient condition for mean square continuity as the underlying child GPs. Intuitively, we are computing a continuously weighted average of smooth random fields, so the resulting random field should also be smooth. Note that it is not necessary to aggregate the predictions of each local model. Instead, one may elicit predictions only from the C 0 < C local models most similar to the point of prediction, as measured by the kernel. This variation of the prediction method, as used by Nguyen-Tuong et al. [15], may result in faster computation of predictions. However, we caution against this practice without careful consideration. This prediction method will result in a loss of continuity of the mean prediction , and consequently the mean square continuity of the random field f*|x*, since the mean prediction is computed using a maximum function, which is not continuous. The discontinuity will manifest as sharp jumps in the mean predictions, as illustrated in Fig 4. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 4. Example of discontinuous predictive mean. The discontinuous predictive mean of Y in X is due to using only the closest child GP (left) instead of both child GPs (right). Note the center of each child GP (red vertical line). The dashed green vertical line denotes the location of the discontinuity.
https://doi.org/10.1371/journal.pone.0256470.g004 Complexity analysis of the algorithm The algorithm for the splitting GP model has improved complexity in both memory and training/prediction time compared to the full GP regression. Additionally, it maintains some benefit in asymptotic complexity relative to other local GP methods. The complexity of updating the splitting GP model with a single datum is bounded above by , corresponding to a matrix inversion of one child GP. The contribution of the PCA-based splitting procedure to the time complexity is negligible. Assuming the PCA is performed via naive singular value decomposition, the procedure would have complexity amortized over m sequential observations, which yields a linear additive term m < m3. While each child GP has at most m observations, the average case will clearly be lower. It should be noted that we explicitly chose to not utilize a rank-one update of the Cholesky decomposition of the kernel matrix during the update procedure, a method which is used in other local GP methods [15]. We made this choice to permit updating the model with a batch of observations, in addition to fully sequential updating, which would require an update of rank greater than one. We also observed empirically that repeatedly updating a single child GP using rank-one updates would cause numerical issues if the kernel length-scale parameter is small. Since the time complexity of the algorithm is characterized primarily by the parameter m, the largest number of observations which may be associated with a single child model, it is straightforward to select an appropriate parameter value for applications requiring rapid sequential updating. After empirically determining the necessary wall-clock time needed for an update, the parameter may be adjusted appropriately. The splitting algorithm also imposes a lower memory complexity in terms of the number of observations, n. It is well known that local GP methods effectively store only a block-diagonal approximation of the full covariance matrix [4, 5]. In particular, when the splitting algorithm has n observations, ⌊n/m⌋ child GPs have been created, each of which will store a kernel matrix with up to m2 entries. The resulting memory complexity of the algorithm is then , where is the memory complexity of a full GP regression. In contrast, the rBCM [16] has an asymptotic memory complexity of , where E is the (constant) number of experts specified as a parameter. Nguyen-Tuong et al. [15] did not present the memory complexity of their local GP model but, under the mild assumption that the input space is bounded, it can be shown that the asymptotic memory complexity is . Notably, the splitting GP model is, to the best of our knowledge, the only local GP model to achieve a linear memory complexity.
Conclusion In this paper, we have developed an algorithm for constructing splitting GP regression models for potentially non-stationary latent functions using sequential data. We have shown that splitting GP models attain comparable predictive performance, while addressing critical shortcomings of other local GP models, such as discontinuity of predictions, lack of flexibility for modeling non-stationary latent functions, and opaque parameters which may be challenging to tune. Furthermore, splitting GP models are shown to enjoy linear memory complexity which, to the best of our knowledge, is the best among existing local GP methods, which typically have quadratic memory complexity. An implementation of the splitting GP model is available at
https://github.com/nick-terry/Splitting-GP.
Appendix Proofs Proposition 1. Proof. We consider the case of a single local model being split into two new local models. By definition in (2), the posterior mean at an input x* is given by By Eq (3), we then have (5) where α = S−1 k(c 1 , x*). Assuming that no data is observed since the split, we can see that (6) Finally, applying Eq (1) yields Proposition 2. Proof. A common result on random fields gives that the random field f*|x* is mean square continuous if and only if its expectation and covariance functions are continuous; see [31], Theorem 5.2, for example. It then suffices to show that is continuous. (7) (8) Recall that the predictive mean is a linear function of x*, and therefore continuous. Note from (3) that S−1 is continuous if and only if k is continuous. In this case, is continuous, and hence f*|x* is mean square continuous. The splitting GP algorithm Here we specify three algorithms which describe the main operations of the splitting GP model: splitting a GP, updating the model, and computing the predictive mean. We make use of the much of the same notation defined in the main paper. From a computational perspective, the necessary components to construct a GP model are the training inputs and training response, X i and Y i , respectively. For convenience of notation, the following algorithms are thus described in terms of matrix operations on these variables. We will use the triple (X i , Y i , c i ) to characterize the ith child GP, where c i is the center of the GP as defined in the main paper. The entirety of the splitting GP model is itself given as a triple , where is a set of the child GPs, k θ is the kernel function with hyperparameter θ, and m is the splitting limit of the splitting GP model. The following pseudo-code makes use of explicit for loops for clarity, but please note that our implementation makes use of vectorized versions of these algorithms for efficiency. We first specify the splitting algorithm, which is used to divide a GP into two smaller child GPs. To keep the pseudo-code as general as possible, we define the function PrincipalDirection(X) as one which computes the first principal component vector of a matrix X. Using Split, we can now define the algorithm Update for updating the splitting GP model given a new data point (x, y). We use the shorthand Train to mean the standard fitting of a GP model with the hyperparameter θ to data by means of maximizing the log marginal likelihood [1]. Algorithm 1: Split((X i , Y i , c i )) ; ; ; v ← PrincipalDirection(X i ); for j ← 1 to n i do I ← vT(x ij − c i ); if I > 0 then ; ; ; else ; ; ; end end ; ; Result: Algorithm 2: Update if C = 0 then X 1 ← xT; Y 1 ← yT; c 1 ← x; ; C ← 1; else ; ; ; n I ← n I + 1; ; if n I > m then (X I , Y I , c I ), (X C+1 , Y C+1 , c C+1 ) ← Split ((X I , Y I , c I )); ; C ← C + 1; end end Train ( ); Result: ( ) Finally, the algorithm for computing the mean prediction of response at the input x follows. Here we use the abbreviated Mean function to give the posterior mean for a child GP (X i , Y i , c i ). The posterior mean for each child GP is computed in closed form using linear algebra [1]. Algorithm 3: Predict for i ← 1 to C do s i ← k θ (x, c i ); E i ← Mean((X i , Y i , c i ), x); end ; ; Result: Hyperparameter tuning for the local GP model In the experiment with the synthetic data set, we initially performed a grid search on the parameter w gen of the local GP model [15] from .9 to .1 at increments of .05, and obtained the best results for w gen = .1. However, the resulting MSE was much higher than the other models considered. In the interest of fair comparison, we conducted a second grid search ranging from .1 to 10−3 by increments of 5 × 10−4, and the best parameter was found to be 10−3, for which we show the results in Fig 6. We chose not to continue lowering w gen further, since the local GP model’s MSE may be expected to decrease monotonically with w gen . The parameter w gen defines a similarity threshold, such that if a new observation is sufficiently dissimilar from existing local models (i.e. k(x*, c i ) ≤ w gen , ∀i = 1, …, C), a new local model will be created. If w gen = 0, then only one local GP will be created, so that the model reduces to a full GP. This behavior can be seen in Fig 8, where we performed a grid search on w gen ranging as low as 10−10. For values less than 10−8, limitations to floating point precision caused the local GP model’s prediction procedure to become numerically unstable, so Figs 8 and 9 do not include these parameter values.
[END]
[1] Url:
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0256470
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/