(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .



A deep hierarchy of predictions enables online meaning extraction in a computational model of human speech comprehension [1]

['Yaqing Su', 'Department Of Fundamental Neuroscience', 'Faculty Of Medicine', 'University Of Geneva', 'Geneva', 'Swiss National Centre Of Competence In Research', 'Evolving Language', 'Nccr Evolvinglanguage', 'Lucy J. Macgregor', 'Medical Research Council Cognition']

Date: 2023-03

Model for speech comprehension

We model speech perception by inverting a generative model of speech that is able to generate semantically meaningful sentences to express possible facts about the world. Since our main goal is to illustrate the cognitive aspect of speech comprehension, we use the model to simulate a semantic disambiguation task similar to MacGregor and colleagues [32]. The task assesses the semantic ambiguity early in a sentence, which is disambiguated later in the sentence on half of the trials. Speech inputs to the model were synthesized short sentences adapted from MacGregor and colleagues [32].

In the next section, we describe the speech stimuli, present the generative model, and briefly describe the approximate inversion of the generative model as well as the two information theoretic measures that could be related to measurable brain activity.

1. Speech stimuli

In the original design of MacGregor and colleagues, 80 sentence sets were constructed to test the subjects’ neural response to semantic ambiguity and disambiguation. Each set consists of four sentences in which two sentence MIDDLE WORDS crossed with two sentence final words. From the two sentence middle words, one was semantically ambiguous, and from the two sentence final words, one disambiguated the ambiguous middle word, and the other did not resolve the ambiguity. For example:

The man knew that one more ACE might be enough to win the tennis.

The woman hoped that one more SPRINT might be enough to win the game.

The middle word was either semantically ambiguous (“ace” can be a special serve in a tennis game, or a poker card) or not (“sprint” only has one meaning of fast running); the two ending words either resolved the ambiguity of the middle word (“tennis” resolves “ace” to mean the special serve, not the poker card) or not (“game” can refer to either poker or tennis game). We chose this set as part of input stimuli to the model but reduced the sentences to essential components for simplicity:

One more ACE/SPRINT wins the tennis/game.

The four sentences point to a minimum of two possible contexts, i.e., the nonlinguistic backgrounds where they might be generated: All combinations can result from a “tennis game” context, and the ACE-game combination can additionally result from a “poker game” context. Importantly, in our model, the context is directly related to the interpretation of the word “ace”.

To balance the number of plausible sentences for each context, we added another possible mid-sentence word “joker”, which unambiguously refers to a poker card in the model’s knowledge. We also introduced another possible sentence structure to add syntactic variability within the same contexts:

One more ACE/SPRINT is surprising/enough.

The two syntactic structures correspond to two different types of a sentence: The “win” sentences describe an event, whereas the “is” sentences describe a property of the subject.

We chose a total of two sentence sets from the original design. The other set (shortened version) is:

That TIE/NOISE ruined the game/evening.

In these sentences, the subject “tie” can either mean a piece of cloth to wear around the neck (“neckband” in the model) or equal scores in a game. The ending word “game” resolves it to the latter meaning, whereas “evening” does not disambiguate between the two meanings. Similar to set 1, we added the possibility of property-type sentences. Table 2 lists all possible sentences and their corresponding contexts within the model’s knowledge (ambiguous and resolving words are highlighted).

The input to the model consisted of acoustic spectrograms that were created using the Praat [103] speech synthesizer with British accent, male speaker 1.

In this work, we are not focusing on timing or parsing aspects, rather on how information is incorporated into the inference process in an incremental manner and how the model’s estimates about a preceding word can be revised upon new evidence during speech processing. Therefore, we chose the syllable as the interface unit between continuous and symbolic representations and fixed the length of the input to simplify the model construction (see details in Generative model). Each sentence consists of four lemma items (single words or two-word phrases), and each lemma consists of three syllables. All syllables were normalized in length by reducing the acoustic signal to 200 samples.

Specifically, in Praat, we first synthesized full words and then separated out syllables using the TextGrid function. A 6-by-200 time-frequency (TF) matrix was created for each unique syllable by averaging its spectro-temporal pattern into 6 log-spaced frequency channels (roughly spanning from 150 Hz to 5 kHz) and 200 time bins in the same fashion as in Hovsepyan and colleagues [26]. Each sentence input to the model was then assembled by concatenating these TF matrices in the appropriate order. Since we fixed the number of syllables in each word (Ns = 3), words consisting of fewer syllables were padded with “silence” syllables, i.e., all-zero matrices. During simulation, input was provided online in that 6-by-1 vectors from the padded TF matrix representing the full sentence were presented to the model one after another, at the rate of 1,000 Hz. In effect, all syllables were normalized to the same duration of 200 ms. The same TF matrices were used for the construction of the generative model as speech templates (see section 2c for details).

2. Generative model

The generative model goes from a nonlinguistic, abstract representation of a message defined in terms of semantic roles to a linearized linguistic sentence and its corresponding sound spectrogram. The main idea of the model is that listeners have knowledge about the world that explains how an utterance may be generated to express a message from a speaker.

In this miniature world, the modeled listener knows about a number of contexts, the scenarios under which a message is generated (to distinguish them from names given to representation levels in the model, we will use italic to refer to factors at each level; see below). Each message can either be of an “event” type that describes an action within the context, or of a “property” type that expresses a characteristic of an entity that exists in the context. Context and type are nonlinguistic representations maintained throughout the message but make contact with linguistic entities via semantics and syntax, which jointly determine an ordered sequence of lemma that then generates the acoustic signal of an utterance that evolves over time.

As in the real world, connections from context to semantics and semantics to lemma are not one-to-one, and ambiguity arises, for example, when two semantic items can be expressed as the same lemma. In this case, the model can output exactly the same utterance for two different messages. When the model encounters such an ambiguous sentence during inference, it will make its best guess based on its knowledge when ambiguity is present (see Model inversion). For illustrative purposes, we only consider a minimum number of alternatives, sufficient to create ambiguity, e.g., the word “ace” only has two possible meanings in the model. Also, while the model generates a finite set of possible sentences, they are obtained in a compositional fashion; they are not spelled out explicitly anywhere in the model and must be incrementally constructed according to the listener’s knowledge.

Specifically, the generative model (Fig 1A) is organized in three hierarchically related submodels that differ in their temporal organization, with each submodel providing empirical priors to the subordinate submodel, which then evolves in time according to its discrete or continuous dynamics for a fixed duration (as detailed below). Overall, this organization results in six hierarchically related levels of information carried by a speech utterance, from high to low (L 1 -L 6 ) we refer to them as context, semantics and syntax, lemma, syllable, acoustic, and the continuous signal represented by TF patterns that stands for the speech output signal.

Each level in the model consists of one or more factors representing the quantities of interest (e.g., context, lemma, syllable …), illustrated as rectangles in Fig 1A. We use the term “states” or hidden states to refer to the values that a factor can take (e.g., in the model the factor context can be in one of four states {‘poker game’, ‘tennis game’, ‘night party’, ‘racing game’}. For a complete list of factors and their possible states of context to lemma levels, see Table 1).

As an example, to generate a sentence to describe an event under a “tennis game” context, the model picks “tennis serve” as the agent, “tennis game” as the patient, and “win” as their relationship. When the syntactic rule indicates that the current semantic role to be expressed should be the agent, the model selects the lemma “ace”, which is then sequentially decomposed into three syllables /eis/, /silence/, /silence/. Each syllable corresponds to eight 6-by-1 spectral vectors that are deployed in time over a period of 25 ms each. The generative model therefore generates the output of continuous TF patterns as a sequence of “chunks” of 25 ms.

We next describe in detail the three submodels:

a. Discrete nonnested: context to lemma via semantic (dependency) and syntax (linearization)

The context level consists of two independent factors: the context c and the sentence type Ty. Together, they determine the probability distribution of four semantic roles: the agent sA, the relation sR, the patient sP, and the modifier sM. An important assumption of the model is that states of context, type, and semantic roles are maintained throughout the sentence as if they had memory. These semantic roles generate a sequence of lemmas in the subordinate level, whose order is determined by the syntax, itself determined by the sentence type. This generative model for the first to the nth lemma is ( denotes the collection of all semantic factors : (1)

Here, p(c) is the prior distribution for the context. The prior probability for the sentence type p(Ty) was fixed to be equal between “property” and “event”.

The terms and can be further expanded as: (2) (3)

When Ty = ‘event’, the sentence consists of an agent, a patient, a relation between the agent and the patient, and a null (empty) modifier. When Ty = ‘property’, the sentence consists of an agent, a modifier that describes the agent, a relation that links the agent and the modifier, and a null patient.

To translate the static context, type, and semantic states into ordered lemma sequences, we constructed a minimal (linear) syntax model consistent with English grammar. We constrain all possible sentences to have four syntactic elements syn1-syn4; values are {‘attribute’, ‘subject’, ‘verb’, ‘object’, ‘adjective’}. The probability of synn is dependent solely on Ty.

The syntactic element syni is active during the ith epoch, and each possible value of the syntax (except ‘attribute’ that directly translates to a lemma item randomly determined within {‘one more’, ‘that’}) corresponds to one semantic factor (semantic factors in the model include subject, verb, object, and adjective):

Subject—agent; Verb—relation; Object—patient; Adjective—modifier.

Thus, sentences of the “event” type are always expressed in the form of subject-verb-object (SVO), and those of the “property” type in the form of subject-verb-adjective (SVadj). In the ith lemma epoch, the model picks the current semantic factor via the value of syn i and finds a lemma to express the value (state) of this semantic factor, using its internal knowledge of mapping between abstract, nonlinguistic concepts to lexical items (summarized in the form of a dictionary in S2 Appendix, Table 1). Note that the same meaning can be expressed by more than one possible lemma, and several different meanings can result in the same lemma, causing ambiguity. The mapping from L 2 to L 3 can be defined separately for each lemma as follows:

The first lemma (w 1 the attribute) does not depend on semantics or syntax and the model would generate “one more” or “that” with equal probability (p = 0.5).

the attribute) does not depend on semantics or syntax and the model would generate “one more” or “that” with equal probability (p = 0.5). w 2 and w 3 are selected according to agent and patient values, respectively, which are themselves constrained by context.

and w are selected according to agent and patient values, respectively, which are themselves constrained by context. w4 can be either a patient or a modifier depending on Ty.

Prior probabilities of context and type, as well as probabilistic mappings between levels (Eqs 2–4), are all defined in the form of multidimensional arrays. Detailed expressions and default values can be found in S1 Appendix.

b. Discrete nested: lemma to spectral

Over time, factors periodically make probabilistic transitions between states (not necessarily different). Different model levels are connected in that during the generative process; discrete hidden (true) states of factors in a superordinate level (L n ) determine the initial state of one or more factors in the subordinate level (L n+1 ). The L n+1 factors then make a fixed number of state transitions. When the L n+1 sequence is finished, L n makes one state transition and initiates a new sequence at L n+1 . State transitioning of different factors within the same level occurs at the same rate. We refer to the time between two transitions within each level as one epoch of the level. Thus, model hierarchies are temporally organized in that lower levels evolve at higher rates and are nested within their superordinate levels.

The formal definition of the discrete generative model is shown in Eq 1, where the joint probability distribution of the mth outcome modality (here generally denoted by om, specified in following sections) and hidden states (generally denoted by sn) of the nth factor up to a time point τ is determined by the priors over hidden states at the initial epoch P(sn, 1), the likelihood mapping from states to outcome P(o|s) over time 1:τ, and the transition probabilities between hidden states of two consecutive time points P(sn, t|sn, t-1) up to t = τ: (4)

For lower discrete levels, representational units unfold linearly in time, and a sequence of subordinate units can be entirely embedded within the duration of one superordinate epoch. Therefore, the corresponding models are implemented in a uniform way: The hidden state consists of a “what” factor that indicates the value of the representation unit (e.g., the lemma ‘the tennis’) and a “where” factor that points to the location of the outcome (syllable) within the “what” state (e.g., the second location of ‘tennis’ generates syllable ‘/nis/’). During one epoch at each level (e.g., the entire duration of the lemma “the tennis”), the value of the “what” factor remains unchanged with its transition probabilities set to the unit matrix. The “where” factor transitions from 1 to the length of the “what” factor, which is the number of its subordinate units during one epoch (three syllables per lemma). Together, the “what” and “where” states at the lemma level generate a sequence of syllables by determining the prior for “what” and “where” states in each syllable. In the same fashion, each syllable determines the prior for each spectral vector. Thus, the syllable level goes through 8 epochs, and for each epoch, the output of the syllable level corresponds to a spectral vector of dimension (1 × 6, number of frequency channels). This single vector determines the prior for the continuous submodel.

Such temporal hierarchy is roughly represented in Fig 1B (downward arrows).

Unlike L 1 and L 2 states that are maintained throughout the sentence, states of the lemma level and below are “memoryless”, in that they are generated anew by superordinate states at the beginning of each epoch. This allows us to simplify the model inversion (see next section) using a well-established framework that exploits the variational Bayes algorithm for model inversion [70]. The DEM framework of Friston and colleagues [70] consists of two parts: hidden state estimation and action selection. In our model, the listener does not perform any overt action (the state estimates do not affect state transitioning); therefore, the action selection part is omitted.

Using the notation of Eq 1, parameters of the generative model are defined in the form of multidimensional arrays:

Probabilistic mapping from hidden states to outcomes: (5)

Probabilistic transition among hidden states: (6)

Prior beliefs about the initial hidden states: (7)

For each level, we define A, B, D matrices according to the above description of hierarchical “what” and “where” factors:

Probability mappings (matrix A) from a superordinate “what” to a subordinate “what” states are deterministic, e.g., p(sylb = ‘/one/’|lemma = ‘one more’, where = 1) = 1, and no mapping is needed for “where” states;

Transition matrices ( B ) for “what” factors are all identity matrices, indicating that the hidden state does not change within single epochs of the superordinate level;

) for “what” factors are all identity matrices, indicating that the hidden state does not change within single epochs of the superordinate level; Transition matrices for “where” factors are off-diagonal identity matrices, allowing transition from one position to the next;

Initial states ( D ) for “what” factors are set by the superordinate level and always start at position 1 for “where” factors. c. Continuous: acoustic to output

) for “what” factors are set by the superordinate level and always start at position 1 for “where” factors.

The addition of an acoustic level between the syllable and the continuous levels is based on a recent biophysically plausible model of syllable recognition, Precoss [26]. In that model syllables were encoded with continuous variables and represented, as is the case here, by an ordered sequence of 8 spectral vectors (each vector having 6 components corresponding to 6 frequency channels). In the current model, we only implemented the bottom level of the Precoss model (see also [28]), which deploys spectral vectors into continuous temporal patterns. Specifically, the outcome of the syllable level sets the prior over the hidden cause, a spectral vector I that drives the continuous model. It represents a chunk of the TF pattern determined by the “what” and “where” states of the syllable level sω and sγ, respectively: (8) (9)

The noise terms ε′ is random Gaussian fluctuation. TF ωγ stands for the average of the 6 × 200 TF matrix of syllable ω in the γth window of 25 ms. G and W are 6 × 6 connectivity matrices that ensure the spectral vector I determines a global attractor of the Hopfield network that sets the dynamics of the 6 frequency channels. Values of G, W, and a scalar rate constant κ in Eqs 9 and 10 are the same as in Precoss: (10)

The continuous state of x determines the final output of the generative model v, which is compared to the speech input during model inversion. As x, v is a 6 × 1 vector: (11)

The precision of the output signal depends on the magnitude of the random fluctuations in the model (ε in Eqs 8, 10, and 11). During model inversion, the discrepancy between the input and the prediction of the generative model, i.e., the prediction error, are weighted by the corresponding precisions and used to update model estimates in generalized coordinates [41]. We manipulated the precisions for continuous state x and activities of frequency channels v to simulate from intact (HP) to impaired (LP) periphery. The precision for top-down priors from the syllable level, PI, was kept high for all simulations (see Table 1 for values used in different conditions).

The continuous generative model and its inversion were implemented using the ADEM routine in the SPM12 software package [104], which integrates a generative process of action. Because we focus on passive listening rather than interacting with the external world, this generative process was set to identical to the generative model and without an action variable. Precisions for the generative process were the same for all simulations (Table 4).

3. Model inversion

The goal of the modeled listener is to estimate posterior probabilities of all hidden states given observed evidence p(s|o), which is the speech input to the model, here represented by TF patterns sampled at 1,000 Hz. This is achieved by the inversion of the above generative model using the variational Bayesian approximation under the principle of minimizing free energy [105]. Although this same computational principle is applied throughout all model hierarchies, the implementation is divided into three parts corresponding to the division of the generative model. Because the three “submodels” are hierarchically related, we follow and adapt the approach proposed in [70], which shows how to invert models with hierarchically related components through Bayesian model averaging. The variational Bayes approximation for each of the three submodels is detailed below.

Overall, the scheme results in a nested estimation process (Fig 1B). For a discrete-state level L n , probability distributions over possible states within each factor are estimated at discrete times over multiple inference epochs. Each epoch at level L n starts as the estimated L n states generate predictions for initial states in the subordinate level L n+1 and ends after a fixed number of state transitions (epochs) at L n+1 . State estimations for L n are then updated using the discrepancy between the predicted and observed L n+1 states. The L n factors make transitions into the next epoch immediately following the update, and the same process is repeated with the updated estimation. Different model hierarchies (from L 2 on) are nested in that the observed L n+1 states are state estimations integrating information from L n+2 with the same alternating prediction–update paradigm, but in a faster timescale. A schematic of such a hierarchical prediction–update process is illustrated in Fig 1B.

Since levels “lemma” to the continuous acoustic output conform to the class of generative models considered in [70], we use their derived gradient descent equations and implementation. Levels “context” and “semantic and syntax” do not conform to the same class of discrete models (due to their memory component and nonnested temporal characteristics); we therefore derived the corresponding gradient descent equations based on free energy minimization for our specific model of the top two levels Eqs 2–4 (see S3 Appendix for the derivation) and incorporated them into the general framework of DEM [70].

The variational Bayes approximation for each of the three submodels is detailed below.

a. Lemma to context

For all discrete-state levels, the free energy F is generally defined as [105]: (12) (13)

In Eqs 12 and 13, Q(s) denotes the estimated posterior probability of hidden state s, P(o|s) the likelihood mapping defined in the generative model, and P(s) the prior probability of s. The variational equations to find the Q(s) that minimizes free energy can be solved via gradient descent. We limit the number of gradient descent iterations to 16 in each update to reflect the time constraint in neuronal processes.

Although context/type and semantic/syntax are modeled as two hierarchies, we assign them the same temporal scheme for the prediction–update process at the rate of lemma units, i.e., they both generate top-down predictions prior to each new lemma input and fulfill bottom-up updates at each lemma offset. Therefore, it is convenient to define their inference process in conjunction.

The posterior distribution is approximated by a factorized one, , and is parameterized as follows:

Here, the model observation is the probability of the word being wτ given the observed outcome oτ, p(wτ| oτ), which is gathered from lower-level models described in next sections. We denote p(wτ| oτ) by a vector W i τ, where τ stands for the epoch, and i indexes the word in the dictionary. At the beginning of the sentence, the model predicts the first lemma input, which is, by definition, just one of the two possible attributes, ‘one more’ or ‘that’.

(14)

The lower levels then calculate p(w1|o1) and provide an updated W i 1 that incorporates the observation made from the first lemma. This is passed to the top levels to update L 1 and L 2 states. Following this update, the next epoch is initiated with the prediction for w2. Because w2 does not directly depend on lemma inputs before and after itself, we can derive the following informed prediction of w2 from Eq 2, where prior for L 1 and L 2 factors are replaced by their updated posterior estimates: (15) where we used:

During the second epoch, the model receives input of the second lemma and updates the estimation of W i 2. The updated W i 2 is then exploited to update L 1 and L 2 states, which, in turn, provides the prediction for w3. The process is repeated until the end of the sentence.

The updating of L 1 and L 2 states, i.e., the estimation of their posterior probabilities after receiving the nth lemma input relies on the minimization of the total free energy F 1,2 of the two levels (L 1 , L 2 ) (16)

The expanded expression of F 1,2 and derivation of the gradient descent equations can be found in S3 Appendix.

b. Spectral to lemma

The memoryless property of lower-level (lemma and below) states implies that the observation from the previous epoch does not directly affect the prediction for the new epoch, only indirectly through the evidence accumulated at superordinate levels. The framework from Friston and colleagues [70] is suitable for such construction. It uses the same algorithm of free energy (inserting Eqs 5–7 to Eqs 12 and 13) minimization for posterior estimation, but this time, there is conditional independence between factors in the same level. We implemented this part of the model by adapting the variational Bayesian routine in the DEM toolbox from the SPM12 software package.

c. Continuous to spectral

To enable the information exchange between the continuous and higher discrete levels that were not accounted for in [26], we implemented the inversion of the spectral-to-continuous generative model using the “mixed model” framework in [70]. Essentially, the dynamics of spectral fluctuation determined by each spectral vector I (Eq 8) is treated as a separate model of continuous trajectories, and the posterior estimation of I constitutes post hoc model comparison that minimizes free energy in the continuous format. For a specific model m represented by spectral vector I m , the free energy F(t) m can be computed as (adapted from [70]): (17) (18)

[END]
---
[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3002046

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/