(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
SiGMoiD: A super-statistical generative model for binary data
['Xiaochuan Zhao', 'Department Of Physics', 'University Of Florida', 'Gainesville', 'Florida', 'United States Of America', 'Germán Plata', 'Elanco Animal Health', 'Greenfield', 'Indiana']
Date: 2021-08
In modern computational biology, there is great interest in building probabilistic models to describe collections of a large number of co-varying binary variables. However, current approaches to build generative models rely on modelers’ identification of constraints and are computationally expensive to infer when the number of variables is large (N~100). Here, we address both these issues with S uper-stat i stical G enerative Mo del for b i nary D ata (SiGMoiD). SiGMoiD is a maximum entropy-based framework where we imagine the data as arising from super-statistical system; individual binary variables in a given sample are coupled to the same ‘bath’ whose intensive variables vary from sample to sample. Importantly, unlike standard maximum entropy approaches where modeler specifies the constraints, the SiGMoiD algorithm infers them directly from the data. Due to this optimal choice of constraints, SiGMoiD allows to model collections of a very large number (N>1000) of binary variables. Finally, SiGMoiD offers a reduced dimensional description of the data, allowing us to identify clusters of similar data points as well as binary variables. We illustrate the versatility of SiGMoiD using several datasets spanning several time- and length-scales.
Collectively varying binary variables are ubiquitous in modern biology. Given that the number of possible configurations of these systems typically far exceeds the number of available samples, generative models have become an essential tool in quantitative descriptions of binary data. The state-of-the-art approaches to build generative models have several conceptual limitations. Specifically, they rely on the modeler choosing system-appropriate constraints, which can be challenging in systems with many complex interactions. Moreover, they are computationally expensive to infer when the number of variables is large (N~100). To address this issue, we propose a theoretical generalization of the maximum entropy approach that allows us to model very high dimensional data; at least an order of magnitude higher than what is currently possible. This framework will be a significant advancement in the computational analysis of covarying binary variables.
Funding: XZ and PD would like to thank the University of Florida startup fund for their salaries. XZ would like to thank the University of Florida Research Opportunity Fund RPD-ROSF2021 for his salary. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
In order to study covariation in a large number of binary variables in a constraint-agnostic and numerically efficient manner, we propose a novel dimensionality reduction framework inspired from statistical physics; S uper-stat i stical G enerative Mo del for b i nary D ata (SiGMoiD). SiGMoiD is a generalization of the max ent approach and has several salient features that distinguish it from the state-of-the-art models of binary variables. (1) In SiGMoiD, the modeler only specifies the total number of constraints. The constraints are optimally learned from the collected samples and a max ent model is fit to those constraints. (2) As a result of this optimal choice of constraints, SiGMoiD requires a much smaller number of constraints than traditional max ent. Consequently, the inference in SiGMoiD is significantly faster than typical max ent models, allowing us to analyze very high dimensional data sets (dimensions ≫1000) that remain well out of the reach of current max ent methods. (3) SiGMoiD does not assume that collected samples are drawn from the same distribution. Instead, motivated by superstatistics, it imagines each sample as arising from its own max ent probability distribution. Since each sample is approximated by a small set of Lagrange multipliers, SiGMoiD is also a non-linear dimensionality reduction method. Below, we first sketch the outline of SiGMoiD and then illustrate it utility by applying it to several data sets.
Unfortunately, estimating the frequency of occurrence of every possible binary configuration from available samples is not possible for any reasonably sized collection; a system with N co-varying binary variables has 2 N possible configurations and the number of collected samples is typically orders of magnitude lower than the number of configurations. At the same time, given the complexity of interactions, in most cases, it is infeasible to build bottom-up mechanistic models to describe these systems. A popular alternative is to derive approximate top-down probabilistic models and train those models on the data. Over the past two decades, the maximum entropy (max ent) method [ 4 ] has emerged as perhaps the only candidate for building approximate generative models across a variety of contexts [ 5 – 12 ]. Briefly, amongst all probability distributions (models) that are consistent with user-specified constraints, max ent chooses the least biased one; the max ent distribution does not disfavor any outcome unless warranted by the imposed constraints. However, traditional application of max ent has several drawbacks. (1) Perhaps the biggest limitation is that the modeler is required to a priori identify constraints that are appropriate for a given system. Depending on the complexity of interactions, these constraints may not be obvious. A work around is to impose a very large number of constraints. For example, a max ent model to analyze correlated firing N neurons will typically involve N constraints on mean firing rates of individual neurons and ~N 2 constraints on covariations in firing rates for all pairs of neurons. (2) These user-identified constraints are imposed using Lagrange multipliers and the multipliers need to be tuned such that model predictions numerically match imposed constraints. However, most often these multipliers cannot be determined analytically and have to be inferred numerically. The most common approach is to use Markov chain Monte Carlo (MCMC) methods to estimate the gradients of the log-likelihood of the data with respect to the Lagrange multipliers and then performing gradient ascent [ 10 , 13 , 14 ]. These calculations are computationally infeasible even when the number of dimensions is only moderately large (N~100). (3) The numerical values of the imposed constraints are often evaluated using experimental samples which implicitly assumes that samples points are statistically independent of each other. This assumption is not true in most practical applications, for example, in temporally correlated firing of neurons or phylogenetically related protein sequences [ 10 , 15 ]. These limitations taken together have severely limited the application of max ent when there an increasing interest in describing the collective behavior of thousands of cells, genes or microbial species, among others.
Recent technical advances allow us to collect high resolution and high dimensional data across several biological systems. In several cases, these data can be accurately represented as collectively varying binary variables. Significant examples can be found in genomics, where sequencing data is first mapped to gene families and then the presence or absence of thousands of genes across microbial genomes is investigated [ 1 ], in microbial ecology, where sequences of the 16s ribosomal gene are first mapped onto operational taxonomic units (OTUs) and then presence or absence of species across microbiomes is investigated to identify direct metabolic interactions [ 2 ], or in neuroscience, where electrical current recordings from hundreds of thousands neurons are binarized into spike trains which are then related to organismal level tasks [ 3 ].
Results
The model We assume that experimental measurements are in a form where individual samples (data points), indexed by subscript s, comprise N binary variables {σ is } (i∈[1,N], s∈[1, S]) that take values 0 or 1. Let us denote by π si the probability that σ si = 1 and by π s the vector of probabilities π s = {π s1 , π s2 ,…,π sN }. To motivate our model framework (Fig 1), we imagine the following physical process: for a fixed sample s, each binary variable i in the collection of N variables is interacting with the same bath that can exchange K types of extensive variables (energies). The kth type of energy (feature) for each binary variable in the state when it is active (σ si = 1) is E ki and zero when it is inactive (σ si = 0) (denoted collectively by E). Under these circumstances, the probability of the ith binary variable is equal to 1 in the sth sample is given by the Gibbs-Boltzmann distribution [16]: (1) In Eq 1, β s = {β s1 , β s2 ,…,β sK } are the intensive variables (latent space representation or latents) specific to sample s. The probabilities in Eq 1 are the maximum entropy probability distributions when averages 〈E ki 〉 s (k∈[1,K]) of the K types of energies are specified for each variable (i∈[1,N]) for every sample s. PPT PowerPoint slide
PowerPoint slide PNG larger image
larger image TIFF original image Download: Fig 1. Schematic of the SigMoiD approach. Probabilities π si for binary variables i in samples s are generated according to a Gibbs-Boltzmann distribution with energies (features) E and inverse temperatures (latents) β. The observed data (samples) is assumed to have arisen from Bernoulli trials based on the model probabilities. SiGMoiD infers the parameters E and β using maximum likelihood inference.
https://doi.org/10.1371/journal.pcbi.1009275.g001 We have set up the model such that the latents β sk depend on the sample index s but not on the index i of the binary variables. In contrast, the features E ki depend on the binary variables i but are shared across all samples s. Let us consider that we are given S samples {σ si } (i∈[1,N], s∈[1,S]) of the binary variables. From these samples, we infer sample-specific latents β s and sample-independent features E. To that end, we take a maximum likelihood approach. We write the log-likelihood of the data given the parameters: (2) The log-likelihood can be maximized to determine the parameters using gradient ascent. The gradients are given by: (3) SiGMoiD has several salient features. First, similar to other non-linear dimensionality reduction methods, if K≪N, SiGMoiD offers a reduced dimensional description of the data; the K dimensional vectors β s embed the N dimensional data point σ s in a K≪N dimensional space. In addition, since SiGMoiD is a fully probabilistic approach, it can also be used as a generative model. Random samples can be generated as follows. We first select a random set of latents β s , evaluate the probabilities π s and sample random variables σ as Bernoulli variables using those probabilities. SiGMoiD also allows us to evaluate the probability of a new set of binary variables σ given the other observations. Specifically, the probability is (4) where (5) is the probability of observing the binary variables σ when the latents are fixed at β s . We note that even though we have proposed to identify the parameters using a maximum likelihood approach, given a suitable prior p prior (β, E) over the parameters, we can also estimate the Bayesian uncertainty in parameter estimation using the posterior distribution (6) Finally, we comment on the degeneracies in SiGMoiD inference procedure. If we multiply the S×K matrix of latents β sk by a K×K invertible matrix M:β→βM and simultaneously multiply the K×N matrix of features E ki by M−1:E→M−1E, the SiGMoiD predictions do not change. Therefore, SiGMoiD- based inference of parameters will significantly depend on the initialization. These degeneracies can therefore be minimized or completely removed without changing model performance by imposing additional restrictions on the parameters, for example, by requiring that the latents or the features are orthogonal to each other, i.e. imposing βTβ = I K or EET = I K . We leave these explorations for future studies.
Accuracy of SiGMoiD as a probabilistic model: Modeling the collective firing of neurons Before illustrating SiGMoiD using high dimensional data sets, we first show a comparison between SiGMoiD and the standard approach to model binary variables; a max ent model. We use a previously collected data set measuring the collective firing of 160 retinal neurons for the duration of a movie that lasted 19 seconds [17,18] (see Supplementary Information). We note that inference of a max ent model for the collective firing of all 160 neurons is currently computationally prohibitive. We chose the 15 most active neurons in the data (15 highest firing propensities) to illustrate our approach. First, we inferred a max ent model from the data that constrained mean firing rates and pairwise correlations. The max ent model describes the probability of any configuration σ as: (7) In Eq 7, J ij are coupling constants (Lagrange multipliers) that need to be inferred from the data, typically using gradient ascent of the log likelihood of the data [10,13,14]. Given that there are only 215~3×104 states for 15 neurons, we could estimate model predictions and therefore the coupling constants by an exhaustive brute force summation over all possible states without resorting to MCMC simulations to estimate the gradients. This minimized the errors in max ent inference that arise due to inaccuracies in MCMC-based estimates of average firing rates and neuron-neuron correlations. The resulting max ent model perfectly captured the average firing rates and the pair correlations (S1 Fig). In Fig 2, we compare the max ent model with SiGMoiD. The max ent model has neuron-specific parameters. To match that number, we choose K = 7 in SiGMoiD. In panel (a), we show a comparison between the raw frequencies of individual configurations obtained from data (x-axis) to model predicted probabilities (y-axis, red: max ent, blue: SiGMoiD). The raw frequencies were obtained by counting the number of instances of individual configurations across all 104 samples. It is clear that SiGMoiD has smaller error compared to the max ent model (mean absolute error 8.6×10−6 vs 1.4×10−5). In panel (b), we plot the probability that n neurons fire at the same time as observed in the data (black), predicted using SiGMoiD (blue), and using the max ent model (red). Here too, the SiGMoiD model performs well when capturing the probability of simultaneous activity. In panels (c) and (d), we plot the absolute values of the three-body correlations |〈δσ i δσ j δσ k 〉| as observed in the data (x-axis) and as predicted by the model (y-axis, SiGMoiD, panel (c), max ent, panel (d)). Both models capture the three body correlations with reasonable accuracy; the mean absolute error is 8.2×10−4 vs 1.4×10−3 for the SiGMoiD and the max ent model respectively. This analysis illustrates that the SiGMoiD is better than the max ent based model at capturing the data and making predictions. Next, we move to systems that are currently well out of the reach of max ent methods. PPT PowerPoint slide
PowerPoint slide PNG larger image
larger image TIFF original image Download: Fig 2. Comparison of SiGMoiD with max ent modeling. (A) the frequencies of individual configurations estimated from the samples (x-axis) and from the two models (y-axis), (red: max ent, blue: SiGMoiD). Only the frequencies of the 1442 configurations observed at least once in the samples are shown. (B) the probability that n neurons fire in any given configuration as estimated from samples (black), the max ent model (red), and SiGMoiD (blue), (C) and (D) comparison between the absolute values of three variable correlations 〈δσ i δσ j δσ k 〉 estimated from data (x-axis) and those using the models (y-axis). There are such correlations.
https://doi.org/10.1371/journal.pcbi.1009275.g002
Inference of interactions from bacterial co-occurrences using SiGMoiD Gut microbiomes are complex ecosystems whose statistical properties have received significant attention in the last couple of years [19,20]. Gut bacteria live in species-rich communities where they compete for nutrients and also exchange metabolites with each other. Describing these interactions is critical to map the ecological networks of gut microbiomes and identify targets for controlling microbial communities [21]. However, many of the direct metabolic interactions between gut microbes are likely to occur at a micron length scale [2]. Therefore, it is infeasible to infer these interactions from macroscopic, community-wide abundance co-variation. To address this issue, Sheth et al. [2] recently probed the spatial organization of the gut microbiome at the micron length scale, allowing them to capture putative direct interactions between bacteria. In these experiments, Sheth et al. [2] fractionated mice guts into particles with a median diameter of 30 μm and quantified the membership of 347 operational taxonomic units (OTUs) across 1406 particles. However, given that co-occurrences are transitive (if A interacts and co-occurs with B, and B interacts and co-occurs with C, then A co-occurs with C even in the absence of interactions), it is not possible to use simple co-occurrence calculations to identify putative pairs of directly interacting OTUs [22]. Given that SiGMoiD can directly model occurrence of individual OTUs across particles, it can be used to identify clusters of OTUs that co-vary across particles as well as clusters of particles that show specific OTU occurrence profiles. We therefore analyzed the data collected by Sheth et al. [2] using SiGMoiD (see Supplementary Information). Each particle was characterized by a binary vector of dimension representing the OTUs present in that particle. It is evident that SiGMoiD will fit the data better as the number of components K increases. Unlike the neuron samples which were correlated in time and across different trials, the microbiome particles are likely to be closer to statistical independence. Therefore, we can use information theory-based criteria to select the optimal K that fits the data but avoids overfitting. In S2 Fig, we show the Akaike information criteria (AIC) vs. K for the OTU data. The model picks out K = 8 as the optimal value which we use in further analysis. When individual samples are correlated, one can use cross-validation; splitting the samples in a training vs. a validation set and then evaluating the probability of the validation set, to avoid overfitting. The number of species found in each particle, a gross descriptor of the complexity of the community [8], varies substantially from particle to particle. As shown in Fig 3, generative modeling of the particles using SiGMoiD accurately captures this quantifier of ecological complexity. In Fig 3A we show the probability of co-occurrence of multiple OTUs in any community as observed in the data (black circles) and as predicted by SiGMoiD (blue line). We compare these distributions with the null expectation given by the probabilities of occurrence of n OTUs in any given particle calculated using the occurrence frequencies of individual OTUs but neglecting the correlations between OTUs (red line). The significant difference between the two suggests that SiGMoiD can accurately capture the interactions between OTUs, which in turn allows it to predict the co-occurrence distribution. PPT PowerPoint slide
PowerPoint slide PNG larger image
larger image TIFF original image Download: Fig 3. SiGMoiD models bacterial co-occurrences and interactions. (A) the probabilities of co-occurrence of multiple OTUs in a single particle. Black circles represent the data, the blue line and shaded blue region represents the SiGMoiD predictions and standard deviations around the predictions, and the red line represents a prediction based on mean occupancies of OTUs. (B) Clustergram showing similarity in features between OTUs. The identified outgroup is marked red. (C) PCA of 3 clusters identified using particle-specific latents β s . (d) (upper half) Co-occurrence frequencies of 10 OTUs whose occurrence frequency was most significantly different in cluster 3 compared to the baseline co-occurrence frequencies of the same OTUs (bottom half).
https://doi.org/10.1371/journal.pcbi.1009275.g003 In fact, SiGMoiD can be used to identify specific bacteria with similar occurrence profiles across particles. SiGMoiD characterizes each binary variable (here, OTU presence/absence) using a K dimensional vector of features. OTUs with similar features will have similar co-occurrence profiles as well. Therefore, the feature vector can be used to identify clusters of co-occurring OTUs. SiGMoiD-based clustering of OTUs is a more direct way of identifying clusters by relying on inferred inherent properties of the OTUs rather than their co-occurrence profiles. Fig 3B shows a hierarchical clustering plot of all OTUs using SiGMoiD-inferred features. Among the several identified clusters, we focus on the cluster of 69 OTUs highlighted in the figure. The gut microbiome of mice is dominated by OTUs belonging to the family Lachnospiraceae; ~53% of all the OTUs in the analyzed data belonged to this family. However, these OTUs are not equally distributed across the particles. The cluster highlighted in the figure is statistically significantly enriched with the family Lachnospiraceae (46 out of 69, single tailed hypergeometric distribution p-value 0.009). Notably, the OTUs belonging to this cluster had predominantly positive correlations across different particles; 2330 out of the 2346 unique pairs had a positive correlation with 92% of pairs with a p-value less than 10−2 (84% of pairs with a p-value less than 10−4 and 74% of pairs with a p-value less than 10−6). In comparison, only 50% of unique pairs from other OTUs had a positive correlation and only 23% of those correlations had a p-value less than 0.01. These analyses suggest that SiGMoiD-based features can identify clusters of OTUs that significantly co-occur in a given ecology. There are two types of metabolic interactions between bacteria that lead to co-occurrence in an ecosystem [23], especially at the micron length scale [2]. Genetically related bacteria tend to co-occur because they have similar metabolic networks and can compete for the same resources. In contrast, genetically dissimilar bacteria have different metabolic networks and can cross-feed each other; one species utilizing the metabolic byproducts of another. Therefore, this cluster likely represents the co-occurrence of multiple species in the Lachnospiraceae family that compete with each other for the same resources in the mouse gut. In addition to identifying OTUs that have similar occurrence profiles across communities, SiGMoiD can also be used to identify communities that have similar OTU occurrence profiles. SiGMoiD embeds each high dimensional binary sample in a much lower dimensional space of sample-specific β latents. Using K-means clustering of sample-specific β latents, we identified 3 clusters of particles (S2 Fig). Principal component analysis (PCA)-based visualization of the particles clearly shows the three identified clusters (Fig 3C). Notably, several specific OTUs were co-present with much higher occurrence frequencies in the identified small cluster (cluster 3, comprising 47 particles). In Fig 3D, we compare the pairwise co-occurrence frequency of 10 OTUs whose occurrence frequency was identified to be most significantly different between particles in cluster 3 compared to the baseline using a hypergeometric test (S1 Table). It is clear that compared to the baseline co-occurrence frequency (sub-diagonal half of Fig 3D), the pairwise co-occurrence frequencies of the 10 OTUs are significantly elevated in the communities in cluster 3 compared to the rest of the communities. These analyses show that SiGMoiD can also identify specific communities that comprise strongly co-occurring bacteria that differentiate them from other communities. These significant clusters can potentially be investigated for direct co-operative or competitive interactions, as well as their association with distinct regions of the gut. Importantly, clusters of particles with these tightly correlated species were not detected when we clustered the samples (binary vectors) directly using the same approach (S3 Fig).
[END]
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009275
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/