(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.

(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:https://journals.plos.org/plosone/s/licenses-and-copyright

------------

Understanding degraded speech leads to perceptual gating of a brainstem reflex in human listeners

['Heivet Hernández-Pérez', 'Department Of Linguistics', 'The Australian Hearing Hub', 'Macquarie University', 'Sydney', 'Jason Mikiel-Hunter', 'David Mcalpine', 'Sumitrajit Dhar', 'Department Of Communication Sciences', 'Disorders']

Date: 2021-11

The ability to navigate “cocktail party” situations by focusing on sounds of interest over irrelevant, background sounds is often considered in terms of cortical mechanisms. However, subcortical circuits such as the pathway underlying the medial olivocochlear (MOC) reflex modulate the activity of the inner ear itself, supporting the extraction of salient features from auditory scene prior to any cortical processing. To understand the contribution of auditory subcortical nuclei and the cochlea in complex listening tasks, we made physiological recordings along the auditory pathway while listeners engaged in detecting non(sense) words in lists of words. Both naturally spoken and intrinsically noisy, vocoded speech—filtering that mimics processing by a cochlear implant (CI)—significantly activated the MOC reflex, but this was not the case for speech in background noise, which more engaged midbrain and cortical resources. A model of the initial stages of auditory processing reproduced specific effects of each form of speech degradation, providing a rationale for goal-directed gating of the MOC reflex based on enhancing the representation of the energy envelope of the acoustic waveform. Our data reveal the coexistence of 2 strategies in the auditory system that may facilitate speech understanding in situations where the signal is either intrinsically degraded or masked by extrinsic acoustic energy. Whereas intrinsically degraded streams recruit the MOC reflex to improve representation of speech cues peripherally, extrinsically masked streams rely more on higher auditory centres to denoise signals.

Funding: H.H.P. was supported in this study by an International Macquarie University Excellence Scholarship ( https://www.mq.edu.au/research/phd-and-research-degrees/scholarships/scholarship-search/data/international-hdr-main-scholarship-round ) and the The HEARing Cooperative Research Centre ( https://www.hearingcrc.org/ ) J.M.H. was supported in this study by an Australian Research Council Laureate Fellowship (FL 160100108) awarded to D.M ( https://www.arc.gov.au/grants/discovery-program/australian-laureate-fellowships ).The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

We found that when task difficulty was maintained across speech manipulations, measures of hearing function at the level of the cochlea, brainstem, midbrain, and cortex were modulated depending on the type of degradation applied to speech sounds and on whether speech was actively attended. Specifically, the MOC reflex, assessed in terms of the suppression of click-evoked otoacoustic emissions (CEOAEs), was activated by noise-vocoded speech—an intrinsically degraded speech signal—but not by otherwise “natural” speech presented in either babble noise (BN; i.e., noise consisting of 8 talkers) or speech-shaped noise (SSN; i.e., noise sharing the long-term average spectrum of our speech corpus). Further, neural activity in the auditory midbrain was significantly increased in active versus passive listening for speech in BN and SSN, but not for noise-vocoded speech. This increase was associated with elevated cortical markers of listening effort for the speech-in-noise conditions. A model of the peripheral auditory system and its processes, including the MOC reflex, confirmed the stimulus-dependent role of the MOC reflex in enhancing neural coding of the speech envelope—a feature strongly correlated with the decoding and understanding of speech [ 47 – 51 ]. Our data suggest that otherwise identical performance in active listening tasks may invoke quite different efferent circuits, requiring not only different levels, but also different kinds, of listening effort, depending on the type of stimulus degradation experienced.

Physiological recordings in the central auditory pathway, including brainstem, midbrain, and cortical responses, were made while listeners performed an active listening task (detecting non-words in a string of Australian-English words and non-words). Importantly, our experimental paradigm was designed to maintain fixed levels of task difficulty. This allowed us to preserve comparable task relevance across different speech manipulations and avoid confounding effects of task difficulty on attention-gated activation of the MOC reflex. Additionally, visual and auditory scenes were identical across conditions to control for differences in alertness between active and passive listening.

Here, we sought to determine whether cochlear gain is modulated in a task-dependent manner by selective recruitment of the MOC reflex. If the MOC reflex is sensitive to goal-directed control and improves understanding of degraded speech, then increases in task difficulty should be accompanied by reduced cochlear gain. We therefore assessed the extent to which the MOC reflex controls cochlear gain in active versus passive listening, i.e., when participants were required to attend to speech stimuli in order to complete a listening task compared to when they were not required to attend and instead watched a silent, non-subtitled film. In order to manipulate task difficulty, we employed speech sounds in background noise—stimuli traditionally used to evoke the MOC reflex [ 39 – 41 ]—and noise vocoding of “natural,” clean speech—filtering that mimics processing by a cochlear implant (CI) [ 42 ]. Unlike speech in noise, noise-vocoded speech allows for manipulation of intelligibility without the addition of spectrally broadband acoustic energy that intrinsically activates the MOC reflex [ 43 – 46 ]. Noise-vocoded speech should therefore enable better detection of any modulation of the MOC reflex by perceptual gating as task difficulty increases.

MOC reflex–mediated changes in cochlear gain can be assessed by measuring otoacoustic emissions (OAEs), energy generated by the active OHCs and measured noninvasively as sound from the ear canal [ 31 ]. When transient sounds such as clicks are delivered to one ear in the presence of noise in the opposite ear, OAE amplitudes are expected to be reduced, reflecting increased MOC reflex activity [ 32 ]. However, the extent to which OAEs are suppressed has been reported as either positively [ 29 , 33 , 34 ], negatively [ 27 , 35 ] or even uncorrelated [ 36 , 37 ] with the performance in speech-in-noise tasks. Modulation of cochlear gain through the MOC reflex could depend on factors such as task difficulty or relevance (e.g., speech versus nonspeech tasks) and even methodological differences in the way in which inner ear signatures such as OAEs are recorded and analysed [ 29 , 38 ].

MOC fibres (ipsilateral and contralateral to each ear) originate in medial divisions of the superior olivary complex in the brainstem and synapse on the basal aspects of OHCs to modulate directly mechanical [ 16 , 17 ] and indirectly neural sensitivity to sound [ 18 , 19 ]. MOC neurons are also innervated by descending fibres from AC and midbrain neurons [ 11 , 20 , 21 ], providing a potential means by which the MOC reflex might be gated perceptually, either by directly exciting/inhibiting MOC fibres or by modulating their stimulus-evoked reflexive activity [ 22 – 25 ]. Although it has been speculated that changes in cochlear gain mediated by the MOC reflex might enhance speech coding in background noise [ 26 – 28 ], its role in reducing cochlear gain during goal-directed listening in normal-hearing human listeners (i.e., those with physiologically normal OHCs) remains unclear. In particular, it is unknown under which conditions the MOC reflex is active, including whether listeners must actively be engaged in a listening task for this to occur [ 27 , 29 , 30 ].

A potential mechanistic pathway supporting cocktail party listening is the auditory efferent system, whose multisynaptic connections extend from the auditory cortex (AC) to the inner ear [ 11 – 13 ]. In particular, reflexive activation by sound of fibres in the medial olivocochlear (MOC) reflex innervating the outer hair cells (OHCs—electromotile elements responsible for the cochlea’s active amplifier) is known to reduce cochlear gain [ 14 ], increasing the overall dynamic range of the inner ear and facilitating sound encoding in high levels of background noise [ 15 ].

Robust cocktail party listening, the ability to focus on a single talker in a background of simultaneous, overlapping conversations, is critical to human communication and a long-sought goal of hearing technologies [ 1 , 2 ]. Problems listening in background noise are a key complaint of many listeners with even mild hearing loss and a stated factor in the non-use and non-uptake of hearing devices [ 3 – 5 ]. However, despite their importance in everyday listening tasks and relevance to hearing impairment, physiological mechanisms that enhance attended speech remain poorly understood. In addition to local circuits in the auditory periphery and brainstem that have evolved to enhance automatically the neural representation of ecologically relevant sounds [ 6 – 8 ], it is likely that such a critical goal-directed behaviour as cocktail party listening also relies on top-down, cortically driven processes to emphasise perceptually relevant sounds and suppress those that are irrelevant [ 9 , 10 ]. Nevertheless, the specific role of bottom up and top-down mechanisms in complex listening tasks remain to be determined.

Our ERP data are consistent with differential cortical contributions to the processing of noise-vocoded and masked speech, being larger in magnitude for speech manipulations in which the MOC reflex was less efficiently recruited, i.e., BN and SSN, and largest overall for the manipulation requiring most listening effort—speech in a background of multitalker babble. From the auditory periphery to the cortex, our data suggest that 2 different strategies coexist to achieve similar levels of performance when listening to single undegraded/degraded streams of speech compared to speech masked by additional noise ( Fig 3B ). The first involves enhanced sensitivity to energy fluctuations through recruitment of the MOC reflex to generate a central representation of the stimulus sufficient and necessary for speech intelligibility of single streams ( Fig 3B , left panel). The second, implemented when processing speech in background noise ( Fig 3B , right panel), preserves cochlear gain by gating off the MOC reflex off; suppression of cochlear gain by an active MOC reflex does not provide any added benefit to the encoding of the stimulus envelope in the periphery. This places the onus of “denoising” on midbrain and cortex auditory structures and processes including loops between them to maximise speech understanding for masked speech.

Later ERP components, such as P2, N400 and the late positivity complex (LPC), are associated with speech- and task-specific, top-down (context-dependent) processes [ 103 , 104 ]. Speech masked by BN or SSN elicited significantly larger P2 components during active listening compared to the noise-vocoded condition, but not significantly different between themselves [F (2,79) = 5.08, p = 0.008, η 2 = 0.11], post hoc with Bonferroni corrections for 3 multiple comparisons: [BN5 versus Voc8 (p = 0.012, d = 0.78); SSN3 versus Voc8 (p = 0.041, d = 0.69); BN5 versus SSN3 (p = 1.00, d = 0.13)]. Similarly, the magnitude of the LPC—thought to reflect the involvement of cognitive resources including memorisation, understanding [ 105 ], and post-decision closure [ 106 ] during speech processing—differed significantly across conditions: [F (2,79) = 4.24, p = 0.018, η 2 = 0.10]. Specifically, LPCs were greater during active listening to speech in BN compared to noise-vocoded speech (p = 0.02, d = 0.85), with LPCs generated during active listening to speech in SSN intermediate to both, but not significantly different from either ( Fig 3A ). Consistent with effortful listening varying across speech manipulations even when isoperformance was maintained [ 107 , 108 ], the speech manipulation generating the clearest signature of cortical engagement was speech in a background of BN. This was considered the most difficult of the masked conditions ( Fig 1D ) where isoperformance was achieved at +5 dB SNR (BN5) compared to only +3 dB SNR for speech in SSN (SSN3). In contrast to P2 and the LPC, the N400 component of the ERP—associated with the processing of meaning [ 93 ]—did not differ between conditions [F (2,81) = 0.22, p = 0.81, η 2 = 0.005]. This is unsurprising, given that participants were equally able to differentiate non-words in Voc8, BN5, and SSN3 conditions, given that isoperformance had been achieved.

We analysed early auditory cortical responses (P1 and N1 components, Fig 3A ) that are largely influenced by acoustic features of the stimulus such as intensity and spectral content [ 98 , 99 ]. Noise-vocoded words elicited well-defined P1 and N1 components compared to the speech-in-noise conditions ( Fig 3A ), despite words and noises having similar onsets to noise-vocoded tokens. This likely reflects the relatively high precision of the envelope components of noise-vocoded speech at stimulus onset compared to the masked conditions in which the competing noises interfere with speech envelope, producing less precise neural responses [ 100 – 102 ].

(A) ERP components (from electrodes: FZ, F3, F4, CZ, C3, C4, TP7, TP8, T7, T8, PZ, P3, and P4) during the active listening of Voc8, BN5, and SSN3. Electrodes’ selection was based on their relevance in attentional and language brain networks [ 93 – 97 ]. Thick lines and shaded areas represent mean and the SEM, respectively. Boxplots on the right show statistical comparisons between speech conditions for P2, N400, and LPC components. (B) Proposed auditory efferent mechanisms for speech processing. The “single stream” mechanism shows how degraded tokens such as noise-vocoded speech are processed in a mostly feed-forward manner (thick black arrows) (as should be the case for natural speech). The activation of the MOC reflex (dark green arrow) improves the AN representation of speech envelope (black arrow, from the cochlea shown as a spiral). This information passes up the auditory centres without much need to “denoise” the signal (represented as black arrows from the brainstem to midbrain to cortex). Given our observation that cochlear gain suppression increased with task difficulty, we included the possibility for enhanced MOC reflex drive from higher auditory regions via corticofugal connections (dark red arrows). By contrast, “multiple streams” such as speech in BN or SSN do not recruit the MOC reflex (light green arrow) because it negatively affects envelope encoding of speech signals (light grey arrows from cochlea–brainstem–midbrain). We therefore propose that corticofugal drive is suppressed to the MOC reflex (shaded red arrow), resulting in weaker MOC reflex activation (light green arrow). This leaves greater responsibility for speech signal extraction to the midbrain, cortex, and the efferent loop therein (corticofugal connections from AC to midbrain: dark red arrow). Both mechanisms ultimately lead to equal behavioural performance across speech conditions. The underlying data can be found in https://doi.org/10.5061/dryad.3ffbg79fw . AC, auditory cortex; AN, auditory nerve; ERP, event-related potential; LPC, late positivity complex; MOC, medial olivocochlear.

To determine the degree of cortical engagement in the active listening task, we recorded cortical evoked potentials from all 56 participants—simultaneously with CEOAE and ABR measurements—using a 64-channel, EEG-recording system. Grand averages of event-related potentials (ERPs) to speech onset ( Fig 3A , S6 Fig ) for the most demanding speech manipulations [Voc8 ( S6A Fig ), BN5 ( S6B Fig ), and SSN3 ( S6C Fig )] were analysed to test the hypothesis that greater cortical engagement occurred when listening to speech in background noise compared to noise-vocoded speech, despite their being matched in task difficulty.

The seeming lack of any contribution from the MOC reflex during active listening to speech masked by speech-like sounds (i.e., BN and SSN) compared to noise-vocoded speech suggests that other compensatory brain mechanisms must contribute to listening tasks if isoperformance is maintained across conditions. We therefore explored whether higher brain centres—providing top-down, perhaps attention-driven, enhancement of speech processing in background noise—contribute to maintaining isoperformance across the different speech degradations. In particular, the significant increase observed in wave V of the ABR for active speech-in-noise conditions suggests greater activity in the IC—the principal midbrain nucleus receiving efferent feedback from auditory cortical areas. Levels of cortical engagement might therefore be expected to differ depending on the form of speech manipulation, despite similar task performance.

Overall, the pattern of neural envelope enhancement observed in our model AN fibres (both low and high threshold) when the MOC reflex was introduced to the different stimulus degradations (right, Fig 2F ) mirrored the observed suppression of CEOAEs for corresponding active listening conditions (left, Fig 2F ). Where activation of the MOC reflex was evident experimentally for noise-vocoded speech, enhancement of neural envelopes was observed when the same degraded stimuli were presented to the model with the MOC reflex present. This was not the case for active listening to masked speech. Here, we found no evidence that MOC activity was modulated by active listening to masked speech, and this was consistent with the poorer neural representations of the stimulus envelope when the MOC reflex was included in the model for these stimulus conditions.

We also examined the effects of efferent feedback on the encoding of temporal fine structure (TFS, the instantaneous pressure waveform of a sound)—a stimulus cue at low frequencies associated with speech understanding [ 50 , 88 – 92 ]—and observed a small, mean improvement in the model with the MOC reflex at low frequencies (<1.5 kHz) for the masked conditions with the lowest SNRs (i.e., BN5 and SSN3; S5 Fig ). Although this improvement in TFS encoding is consistent with other studies whose simulations support a role for efferent unmasking in speech-in-noise processing [ 57 – 59 , 61 ], it cannot explain the lack of CEOAE suppression we observed experimentally for these, most difficult, masked speech tasks ( Fig 1F ).

Although we assessed high-threshold fibres due to their importance at high sounds levels [ 70 – 72 ], the majority of AN fibres possess high spontaneous rate (HSR) with low-threshold (i.e., fibres that respond preferentially to low-intensity sounds but saturate at higher intensities [ 80 – 82 ]). These low-threshold fibres may not only play an important role in envelope processing of speech at low intensities but also contribute at high intensities thanks to their dynamic range adaptation and response fluctuations [ 78 , 83 – 87 ]. We therefore also assessed how these low-threshold, HSR fibres processed the most difficult stimulus degradations (Voc8, BN5, and SSN3; Fig 1D ) in the presence and absence of the MOC reflex ( S4 Fig ). Similar to AN fibres with high thresholds, including the MOC reflex improved envelope encoding by low-threshold AN fibres for noise-vocoded speech and impaired it for speech masked by BN or SSN ( S4A and S4B Fig ). This is despite poorer dynamic range of low-threshold fibres at 75 dBA (normalised sound presentation level across manipulations) likely impacting their overall ability to encode the stimulus envelope.

The stimulus-specific changes in envelope encoding we observed were also evident, but with reduced magnitudes, when we lowered the fixed attenuation of the MOC from 15 dB to 10 dB ( S3A Fig ), demonstrating that manipulating the strength of the MOC reflex may provide a means of selectively enhancing or impairing neural encoding of the stimulus envelope. Increasing the SNR of the masked speech [i.e., from +5 SNR (BN5) to +10 SNR (BN10) for BN and from +3 SNR (SN3) to + 8 SNR (SN8) for SSN] not only diminished the detrimental effects of the MOC reflex on envelope encoding at the lower SNRs ( S3B Fig ), but also led to an overall improvement in envelope encoding with the MOC reflex for BN10 stimuli. This suggests that efferent feedback through the MOC reflex may be unable to enhance the neural representation of speech in background noise at low SNRs. By contrast, introducing the MOC reflex to the neural coding of stimuli with more noise-vocoded channels generated enhanced benefit ( S3B Fig ).

Given the range of acoustic waveforms in our speech corpus, we included 100 words (50 stop/nonstop consonants) in our analysis ( Fig 2E ). Despite the diversity of speech tokens, the effects of including MOC reflex (on ρENV) were consistently dependent on the form of stimulus manipulation. Neural encoding of speech envelopes improved significantly with simulated MOC reflex for noise-vocoded words (pink circles, Fig 2E ) (mean ΔρENV for Voc8 for freqs <1.5 kHz = +4.22 ± 0.30%, [Z(99) = 8.66, p < 0.0001, r = 0.86]; mean ΔρENV for Voc8 for freqs >1.5 kHz = + 6.88 ± 0.28%, [Z (99) = 8.67, p < 0.0001, r = 0.87]), with the largest enhancements observed for words with the lowest ρENV values in the absence of MOC reflex. By contrast, no such relationship was observed for words presented in BN (BN5, light blue squares, Fig 2E ) or SSN (SSN3, green diamonds, Fig 2E ). Moreover, envelope coding in both speech-in-noise conditions was significantly impaired, on average, when the MOC reflex was included (mean ΔρENV for BN5 for freqs <1.5 kHz = −0.89 ± 0.11, [Z(99) = −6.67, p < 0.001, r = 0.67]; mean ΔρENV for BN5 for freqs >1.5 kHz = −1.91 ± 0.31%, [Z(99) = −5.704, p < 0.001, r = 0.57]; mean ΔρENV for SSN3 for freqs <1.5 kHz = −2.62 ± 0.12, [Z (99) = −4.10, p = 0.001, r = 0.41]; mean ΔρENV for SSN3 >1.5 kHz = −3.90 ± 0.26, [Z (99) = −8.66, p < 0.0001, r = 0.87]). Given the apparent similarities between single polarity responses for corresponding noise-vocoded and natural stimuli in Fig 2B (i.e., Voc8 ANonly versus Nat ANonly and Voc8 AN+MOC versus Nat AN+MOC , Fig 2B ), we tested how values of ΔρENV were affected when the neural envelopes for degraded speech tokens were compared with their corresponding natural conditions (i.e., Voc8/BN5/SSN3 ANonly versus Nat ANonly and Voc8/BN5/SSN3 AN+MOC versus Nat AN+MOC ) ( S2 Fig ). The stimulus-specific effects we observed in Fig 2 not only remained with this new “control” template for ρENV AN ( S2A Fig ) but were also enhanced for all 3 speech manipulations ( S2B Fig ). The inclusion of the MOC reflex had similar stimulus-dependent effects in low (<1.5 kHz) and high (>1.5 kHz) frequency bands. However, given the increased importance carried by stimulus envelope for acoustic stimuli at high frequencies [ 66 , 78 , 79 ], only neural envelope encoding in the high-frequency band was considered in subsequent simulations and analysis.

To assess how the energy envelope of degraded speech tokens was represented in the output of the population of AN fibres, both normal and polarity-inverted copies of each token were presented to the model for processing ( Fig 2A and 2B ). The polarity tolerant component, associated with the stimulus envelope [ 50 , 73 – 77 ], was extracted from the responses of AN fibres using sumcor analysis ( Fig 2C ). We compared this similarity of neural envelopes between conditions by computing the neural cross-correlation coefficient, ρENV, in each frequency channel ( Fig 2D ) [ 50 , 73 ]. Values of ρENV, ranging from 0 to 1 for independent to identical neural envelopes, respectively, were calculated for the 3 speech manipulations with (ρENV AN ) and without (ρENV MOC ) the MOC reflex included in the model. In each case, the neural envelope for natural speech acted as the “control” template for comparison. “Control” simulations of natural speech were performed with the MOC reflex included, based on our observations that a steady CEOAE suppression—indicative of an active MOC reflex—occurred for natural speech experimentally ( Fig 1F ) and that neural envelopes for natural sentence stimuli were enhanced in model AN fibres with an MOC reflex present ( S1 Fig ).

(A) Schematic of model showing input stimulus and neural output. Normal and inverted polarity waveforms of words (here, the word “Like”) were presented to the MAP_BS model that incorporates a cascade of stages from outer and middle ear to AN output. Attenuation of cochlear gain by the simulated MOC reflex was implemented at the “modified DRNL filter bank” stage [ 58 , 63 ]. Responses of 400 LSR fibres (shown here in the form of raster plots for normal and polarity-inverted waveforms of the word “Like”) constituted the model output at the AN stage. (B) Presentation of natural and degraded versions of the word “Like” with and without simulated MOC reflex. Normal “Like” waveforms for natural (dark grey, far left), Voc8 (pink, second left), BN5 (light blue, second right), and SSN3 (green, far right) conditions are shown in the top row. Post-stimulus time histograms (average response of 400 fibres binned at 1 ms) were calculated for LSR AN fibres (characteristic frequency: 2.33 kHz) with (bottom row) and without (middle row) simulated MOC reflex. Including simulated MOC reflex reduced activity during quiet for natural condition (and Voc8, but less so) while maintaining high spiking rates at peak sound levels (e.g., at 0.075, 0.3, and 0.45ms). No changes in neural representation of signal were visually evident for BN5 and SSN3 “Like”. (C and D) Quantifying ρENV for Voc8 “Like” without simulated MOC reflex in 2.33-kHz channel. Sumcor plots (bottom row, C) were generated by adding shuffled autocorrelograms (thick lines, top left/middle panels, C) or shuffled cross-correlograms (thick line, top right panel, C) to shuffled cross-polarity correlograms (thin lines, top row, C) to compare neural envelopes of the control condition, Naturally spoken “Like” with simulated MOC reflex (left/right columns, C), and the test condition, Voc8 “Like” without simulated MOC reflex (middle/right columns, C). ρENV for Voc8 “Like” without simulated MOC reflex (AN, solid-pink bar, D) was calculated from sumcor peaks in C [ 73 ]. Value of ρENV with simulated MOC reflex (AN+MOC, speckled pink bar, D) is also displayed. (E and F) Comparing ΔρENVs for 100 words after introduction of simulated MOC reflex. Mean percentage changes in ρENVs (calculated in 2 frequency bands: below and above 1.5 kHz) after adding simulated MOC reflex were plotted as a function of ρENV without simulated MOC reflex for degraded versions of 100 words (each symbol represents one word). ΔρENVs were positive for all Voc8 words except 1 (pink circles, E) (Max-Min ΔρENV for Voc8 for <1.5 kHz: +17.62 to −0.78%; Max-Min ΔρENV for Voc8 for >1.5 kHz: +16.14 to −0.62%), appearing largest for words with lowest ρENVs without simulated MOC reflex. This relationship was absent for BN5 (light blue squares, E) and SSN3 (green diamonds, E) words whose ΔρENV ranges spanned the baseline (Max-Min ΔρENV for BN5 for <1.5 kHz: +3.06 to −4.09%; Max-Min ΔρENV for BN5 for >1.5 kHz: +6.08 to −13.56%; Max-Min ΔρENV for SSN3 for <1.5 kHz: +4.52 to −3.58% Max-Min ΔρENV for SSN3 for >1.5 kHz: +0.50 to −11.39%). Progression of mean ΔρENVs (± SEM) for model data >1.5 kHz (checkerboard bars, right, F) mirrored that of the active listening task, CEOAE data (mean ± SEM) (solid colour bars, left, F). The underlying data can be found in https://doi.org/10.5061/dryad.3ffbg79fw . AN, auditory nerve; CEOAE, click-evoked otoacoustic emission; DRNL, dual resonance nonlinear; LSR, low spontaneous rate; MAP_BS, Matlab Auditory Periphery and Brainstem; MOC, medial olivocochlear.

Here, we tested the hypothesis that efferent suppression of cochlear gain differentially impacts neural encoding of masked and noise-vocoded stimulus envelopes in AN fibres. Natural and degraded speech tokens (those generating lowest, isoperformance in the active task: Voc8; BN5; and SSN3; Fig 1C ) were presented to the model at 75 dBA, with and without a fixed 15 dB attenuation generated by the MOC reflex ( Fig 2A ). Responses of 400 AN fibres with low spontaneous rate (LSR) and high thresholds were simulated in each of 30 frequency channels, logarithmically spaced between 0.1 and 4.5 kHz, forming the model’s output ( Fig 2A and 2B ). We chose this type of AN fibre for our model because of their apparent critical role in processing sounds in high levels of background noise [ 70 – 72 ].

Previous modelling studies have supported the ability of the MOC reflex to “unmask” signals in noise in the auditory nerve (AN) [ 55 – 62 ] and, therefore, would not provide a suitable rationale for the absence of CEOAE suppression (i.e., a lack of MOC reflex activity) when participants actively listened to speech in BN or SSN ( Fig 1E and 1F ). To determine how the neural representation of degraded speech differs in the AN with and without the MOC reflex, and whether this might explain the stimulus dependence of CEOAE suppression in Fig 1E and 1F , we implemented a model of the initial auditory stages (outer, middle, and inner ear with the AN) that includes an MOC reflex [ 58 , 63 – 65 ]. We focused on how the MOC reflex affects the encoding of the energy envelope of acoustic waveforms: considered critical to speech understanding (especially noise-vocoded speech [ 42 , 50 ]) [ 42 , 50 , 66 , 67 ], strongly correlated with the cortical tracking and decoding of speech [ 47 – 49 ], and the basis of several successful speech intelligibility models [ 51 , 68 , 69 ].

Together with the effect on CEOAEs, these data suggest that the magnitude of auditory midbrain activity for the different speech manipulations reflects cochlear output. While this is evident for both listening conditions in masked speech, the similarity of ABR magnitudes in the midbrain for active and passive listening of noise-vocoded stimuli is indicative of feed-forward amplification that compensates for reduced cochlear gain during active listening. This highlights increased emphasis of peripheral processing for noise-vocoded, compared to noise-masked, speech and suggests that processing by higher-order auditory centres may be involved in decoding masked speech.

To exclude this stimulus-dependent pattern of inner ear and brainstem responses arising from intrinsic differences in the 2 populations of listeners tested (noise-vocoded versus masked speech experiments), we compared CEOAE suppression as well as the amplitude of ABR waves between the 2 groups for active and passive listening of natural speech. No statistical differences were observed for either active or passive listening between the 2 groups (active natural condition: suppression of CEOAEs [t (23) = −0.21, p = 0.83, d = −0.04; wave III [t (23) = −0.45, p = 0.65, d = 0.09]; wave V [t (23) = 0.09, p = 0.93, d = 0.02]; passive natural condition: suppression of CEOAEs [t (24) = −0.36, p = 0.72, d = 0.07]; wave III [t (24) = −0.16, p = 0.88, d = 0.03]; wave V [t (26) = 0.40, p = 0.69, d = 0.05]). We conclude from this that the differences observed in cochlear gain and auditory brainstem/midbrain activity can be attributed to the specific form of speech degradation.

Click-evoked ABRs—measured during presentation of speech in noise—showed similar effects to those observed for CEOAE measurements. Specifically, in both masked conditions, wave V of the ABR—corresponding to neural activity generated in the midbrain nucleus of the inferior colliculus (IC)—was significantly enhanced in the active, compared to the passive, listening condition ( Fig 1E ) (speech in BN: [F (1, 26) = 5.66, p = 0.025, η 2 = 0.20] and SSN: [F (1, 26) = 9.22, p = 0.005, η 2 = 0.26]). No changes in brainstem or midbrain activity were observed between active and passive listening of noise-vocoded speech.

The effects of active versus passive listening on cochlear gain were evident in the activity of subcortical auditory centres when we simultaneously measured auditory brainstem responses (ABRs) to the same clicks used to evoke CEOAEs. Click-evoked ABRs largely reflect summed activity of higher-frequency regions of the cochlea (3 to 8 kHz [ 53 , 54 ]). However, as CEOAE suppression in the 1 to 2 kHz band is used here as a marker for MOC reflex activity across the entire length of the cochlea (see Materials and methods ), we can relate observed changes in cochlear gain to amplitudes of ABR waves.

By contrast, speech embedded in SSN elicited the opposite pattern of results to noise-vocoded speech ( Fig 1E and 1F ). The suppression of CEOAEs was significantly stronger during passive, compared to active, listening ( Fig 1F ): rANOVA: [F (1, 24) = 4.44, p = 0.046, η 2 = 0.16], and we observed a significant interaction between condition and stimulus type: [F (2, 48) = 4.67, p = 0.014, η 2 = 0.16] for both SNRs: SSN8 [t (27) = 2.71, p = 0.01, d = 0.51] and SSN3 [t (28) = 2.67, p = 0.012, d = 0.50]. We also observed a mild suppression of CEOAEs for speech masked by BN, with CEOAEs significantly smaller than their own baseline measures only during passive listening (shown in the planned t test, S1 Table ), but not when active and passive conditions were compared ( Fig 1E and 1F ) (rANOVA n.s.: F (1, 25) = 1.21, p = 0.28, η 2 = 0.05). Cochlear gain was, therefore, suppressed during active listening of noise-vocoded speech, slightly but significantly suppressed during passive listening in BN, and strongly suppressed during passive listening in SSN. Together, our data suggest that the MOC reflex is modulated by task engagement and strongly depends on the way in which signals are degraded, including the type of noise used to mask speech.

We calculated the reduction in CEOAEs between baseline and experimental conditions (CEOAE suppression, a proxy for activation of the MOC reflex) to quantify efferent control of cochlear gain in active and passive listening. For noise-vocoded speech, suppression of CEOAEs was significantly greater when participants were actively engaged in the lexical task compared to when they were asked to ignore the auditory stimuli: rANOVA: [F (1, 22) = 8.49, p = 0.008, η 2 = 0.28] ( Fig 1E and 1F ). Moreover, we observed a significant interaction between conditions and stimulus type: [F (3, 66) = 2.80, p = 0.046, η 2 = 0.12], indicating that the suppression of CEOAEs was stronger for all vocoded conditions in which listeners were required to make lexical decisions, compared to when they were not ( Fig 1F )—Voc16: [t (23) = −2.16, p = 0.04, d = 0.44]; Voc12: [t (24) = −2.19, p = 0.038, d = 0.44] and Voc8: [t (25) = 3.51, p = 0.002, d = 0.69]. Engagement in the task did not modulate CEOAE suppression for the natural speech condition: [t (24) = 0.62, p = 0.54, d = 0.12].

To determine whether auditory attention modulates cochlear gain via the auditory efferent system in a task-dependent manner, we assessed the effect of active versus passive listening and speech manipulation on the activity of the inner ear. CEOAEs were recorded continuously while participants either actively performed the lexical task or passively listened to the same corpus of speech tokens ( Fig 1A ). We first confirmed that CEOAE amplitudes were significantly reduced relative to a baseline measure (obtained in the absence of speech, Fig 1A ) within each stimulus manipulation (planned t test comparisons, S1 Table ). CEOAEs were significantly reduced in magnitude when actively listening to natural speech and all noise-vocoded stimuli (natural: [t (24) = 2.33, p = 0.03, d = 0.50]; Voc16: [t (23) = 3.40, p = 0.002, d = 0.69]; Voc12: [t (24) = 3.98, p = 0.001, d = 0.80] and Voc8: [t (25) = 5.14, p = 0.001, d = 1.00]). Conversely, during passive listening, CEOAEs obtained during natural, but not noise-vocoded speech were significantly smaller than baseline: [t (25) = 2.29, p = 0.03, d = 0.44] ( S1 Table ). This was also true of CEOAEs recorded during the 2 masked conditions at all SNRs (natural: [t (26) = 2.17, p = 0.04, d = 0.42]; BN10: [t (28) = 2.80, p = 0.009, d = 0.52] and BN5: [t (28) = 2.36, p = 0.02, d = 0.44]; SSN8: [t (28) = 3.37, p = 0.002, d = 0.63] and SSN3: [t (28) = 3.50, p = 0.002, d = 0.65]). This suggests that the MOC reflex is gated differently in active and passive listening, and by the different types of speech manipulation, despite listeners achieving isoperformance across experimental conditions (i.e., comparable levels of lexical discrimination).

Isoperformance was achieved across speech manipulations, with best performance observed in the 2 natural speech conditions—one employed during the noise vocoding experiment and the other during the masking experiments: one-way ANOVA [F (1, 54) = 0.43, p = 0.84, η2 = 0.001]. A moderate and similar level of performance (significantly lower than performance for natural speech) was achieved across Voc16/Voc12 (Voc16 versus Voc12: nonsignificant [n.s.] post hoc t test: [t (26) = 2.53, p = 0.11, d = 0.34]), BN10, and SSN8 conditions: one-way ANOVA: [F (3, 108) = 0.67, p = 0.57, η2 = 0.018]. The poorest performance, significantly lower than the high and moderate performance levels, was observed for Voc8, BN5, and SSN3: one-way ANOVA: F (3, 81) = 0.07, p = 0.72, η2 = 0.008].

Distinct levels of task difficulty were achieved by altering either the number of noise-vocoded channels—16 (Voc16), 12 (Voc12), and 8 (Voc8) channels—or by altering the signal-to-noise ratio (SNR) when speech was masked by BN—+10 (BN10) and +5 (BN5) dB SNR—or SSN—+8 (SSN8) and +3 (SSN3) dB SNR—( Fig 1D ). This modulation of task difficulty was statistically confirmed in all 56 listeners (n = 27 in the vocoded condition and n = 29 in the 2 masked conditions) who showed consistently better performance—a higher rate of detecting non-words—in the less degraded conditions, i.e., more vocoded channels or higher SNRs in the masked manipulations: repeated measures ANOVA (rANOVA) [vocoded: [F (3, 78) = 70.92, p = 0.0001, η 2 = 0.73]; BN: [F (3, 78) = 70.92, p = 0.0001, η 2 = 0.80]; and SSN: [F (2, 56) = 86.23, p = 0.0001, η 2 = 0.75]; see Fig 1D for post hoc analysis with Bonferroni corrections (6 multiple comparisons for the noise-vocoded experiment and 3 for BN and SSN manipulations).

(A) Schematic representation of the experimental paradigm. Clicks were continuously presented for 12 minutes (grey-striped rectangles) in each experimental condition (e.g., active listening of natural speech). A 1-minute recording of baseline CEOAE magnitudes (black rectangles) was made at the beginning and at the end of each experimental condition. However, due to artefacts in CEOAE recordings, only the initial CEOAE baseline proved to be of sufficient quality for analysis. Speech tokens (blue and black-striped rectangles) were presented for 10 minutes to the ear contralateral to the ear receiving clicks. ABRs and ERPs were also recorded during this time frame. Following presentation of a word or non-word, participants had 3 seconds to make a lexical decision in the active listening condition (i.e., to determine whether an utterance was a word or a non-word) by pressing a button, or they remained silent in the passive condition by ignoring all auditory stimuli while watching a movie. (B) Schematic shows how click stimuli generated both CEOAEs (analysis window enclosed in rectangle) as well as brainstem activity (ABRs). (C) Corresponds to the time course of natural and degraded speech stimuli in relation to cortical activity (ERPs). (D) Performance during the lexical decision task. Mean d’ (measure of accuracy calculated as Z (correct responses) − Z (false alarm) [i.e., Z (correct responses) = NORMSINV (correct responses)] denoted as white circles (n = 27 in the noise-vocoded condition and n = 29 in the 2 masked conditions). The horizontal line denotes the median. Upper and lower limits of the boxplot represent first (q1) and third (q3) quartiles, respectively, while whiskers denote the upper (q3+1.5*IQR) and lower bounds (q1-1.5*IQR) of the data where IQR is the interquartile range calculated as (q3-q1). Post hoc pairwise comparisons showed that highest performance was always achieved for natural speech compared to noise-vocoded, BN, and SSN manipulations (Bonferroni corrected p = 0.001). Performance was moderately high (statistically lower than natural speech but higher than Voc8, BN5, and SSN3) for Voc16/Voc12 (combined due to n.s. differences, Bonferroni corrected p = 0.11) as well as for BN10 and SSN8, respectively. The lowest level of performance was predictably observed for Voc8, BN5, and SSN3 (p = 0.001). (E) CEOAEs and ABRs collapsed (rANOVA main effect of conditions (active and passive)). CEOAEs and ABRs are reported as means ± SEM. (**Bonferroni corrected p < 0.01; *Bonferroni corrected p < 0.05). (F) CEOAE suppression. The figure shows mean CEOAE magnitude (dB SPL) changes relative to the baseline for all conditions where negative values represent an increase in suppression (activation of MOC reflex). The underlying data can be found in https://doi.org/10.5061/dryad.3ffbg79fw . ABR, auditory brain stem response; BN, babble noise; CEOAE, click-evoked otoacoustic emission; ERP, event-related potential; IQR, interquartile range; n.s., nonsignificant; rANOVA, repeated measures ANOVA; SNN, speech-shaped noise.

We assessed speech intelligibility—specifically the ability to discriminate between Australian-English words and non-words—when speech was degraded by 3 different manipulations: noise vocoding the entire speech waveform; adding 8-talker BN to “natural” speech in quiet; or adding SSN to “natural” speech. Participants were asked to make this lexical decision (by means of a button press) when they heard a non-word in a string of words and non-words ( Fig 1A and 1C ).

Discussion

We assessed the role of attention in modulating the contribution of the cochlear efferent system in a lexical task—detecting non-words in a stream of words and non-words. Employing 3 speech manipulations to modulate task difficulty—noise vocoding of words, words masked by multitalker BN, or SSN (i.e., noise with the same long-term spectrum as speech)—we find that these manipulations differentially activate the MOC reflex to modulate cochlear gain. Activation of the cochlear efferent system also depends on whether listeners are performing the lexical task (active condition) or are not required to engage in the task and instead watch a silent, stop-motion film (passive condition). Specifically, with increasing task difficulty (i.e., fewer noise-vocoded channels), noise vocoding increasingly activates the MOC reflex in active, compared to passive, listening. The opposite is true for the 2 masked conditions, where words presented at increasingly lower SNRs more strongly activate the MOC reflex during passive, compared to active, listening. By adjusting parameters of the 3 speech manipulations—the number of noise-vocoded channels or the SNR for the speech-in-noise conditions, we find that lower MOC reflexive activity is accompanied by heightened cortical activation, possibly to maintain isoperformance in the task. A computational model incorporating efferent feedback to the inner ear demonstrates that improvements in neural representation of the amplitude envelope of sounds provides a rationale for either suppressing or maintaining the cochlear gain during the perception of noise-vocoded speech or speech in noise, respectively. Our data suggest that a network of brainstem and higher brain circuits maintains performance in active listening tasks and that different components of this network, including reflexive circuits in the lower brainstem and the relative allocation of attentional resources, are differentially invoked depending on specific features of the listening environment.

Attentional demands reveal differential recruitment of the MOC reflex Our data highlight a categorical distinction between active and passive processing of single, degraded auditory streams (e.g., noise-vocoded speech) and parsing a complex acoustic scene to hear out a stream from multiple competing, spectrally similar sounds (multitalker babble and SSN). Specifically, task difficulty during active listening appears to modulate cochlear gain in a stimulus-specific manner. The reduction in cochlear gain with increasing task difficulty for noise-vocoded speech and, conversely, the preservation of cochlear gain when listening to speech in background noise, suggests that attentional resources might gate the MOC reflex differently depending on how speech is degraded. In contrast to active listening, when participants were asked to ignore the auditory stimuli and direct their attention to a silent film, the MOC reflex was gated in a direction consistent with the auditory system suppressing irrelevant and expected auditory information while (presumably) attending to visual streams [24,25,109]. Nevertheless, auditory stimuli may capture attention in a different manner depending on how easy they are to detect, i.e., saliency based (bottom-up processes) [110]. For example, the pitch conveyed in the fundamental frequency, F0, here in terms of the voice pitch, is a highly salient cue that plays an important role in the perceptual segregation of speech sources [111–113]. When speech is noise vocoded, saliency carried by the envelope periodicity of the speech is diminished [114–117]. In the context of our passive conditions, then, it is possible that noise-vocoded speech was not salient (or distracting) enough to elicit a reduction in cochlear gain sufficient to suppress irrelevant auditory information when attention was presumably focused elsewhere (i.e., on watching a silent film). Interestingly, activation of the MOC reflex was observed for natural speech—further evidence that activation is not limited to tones and broadband noise [118–120]—and did not depend on whether participants were required to attend in a lexical decision task. This is consistent with natural speech being particularly salient as an ethologically relevant and nondegraded stimulus, as well as the low attentional load required when passively watching a film permitting continued monitoring of unattended speech [121]. To explain the variable reduction in cochlear gain between noise-vocoded and noise-masked speech across active and passive conditions (Fig 1E), top-down and bottom-up mechanisms can be posited as candidates to modify the activity of the MOC reflex. Bottom-up mechanisms, such as increased reflexive activation of the MOC reflex by wideband stimuli with flat power spectra [43] may explain why CEOAEs were suppressed during passive listening of speech masked by SSN (Fig 1E). The weaker activation of the reflex observed for speech in BN during passive listening may arise from BN more poorly activating the MOC reflex than “stationary” noises such as white noise or SSN [122,123]. However, if only bottom-up mechanisms were involved, then noise-vocoded speech with relatively fewer channels (i.e., Voc8 stimuli) might have also been expected to activate the MOC reflex more effectively due to their more “noise-like” spectra. The lack of suppression of CEOAEs in the passive noise-vocoded conditions, as well as the stimulus-specific pattern of MOC reflex activity in active listening conditions (i.e., enhanced suppression of cochlear gain for noise-vocoded versus preservation of cochlea gain for masked speech), suggests that a perceptual, top-down categorisation of speech sounds is necessary to engage or disengage appropriately the MOC reflex. Top-down mechanisms may include direct descending control of MOC fibre activity (either excitation or inhibition) or descending modulation of their sound-driven reflexive activity [22–24].

Descending control of the MOC reflex for speech stimuli is likely bilateral A central premise of our study, and of those exploring the effects of attention on the MOC reflex, is that OAEs recorded in one ear can provide a direct measure of top-down modulation of cochlear gain in the opposite ear. However, it has also been suggested that activation of the MOC reflex may differ between ears to expand interaural cues associated with sound localisation (while also enhancing amplitude modulations and suppressing noise in the ear with the better acoustic SNR) [124,125]. This process could be independently modulated by the largely ipsilateral corticofugal pathways evident anatomically (see [23] for review). Had the activation of the MOC reflex been independently controlled at either ear—for example, to suppress irrelevant clicks in one ear while preserving cochlear gain in the ear stimulated by speech—we would have expected similar suppression of CEOAEs across active and passive conditions for all speech manipulations—since click stimuli were always irrelevant to the task. Instead, however, suppression of CEOAEs—a biomarker for activation of the MOC reflex—was both stimulus and task dependent, reducing the likelihood that dichotic presentation of sounds engaged top-down modulation of cochlear gain differentially at either ear. Anatomical evidence of purely ipsilateral corticofugal pathways ignores the possibility that, even when presented monaurally, descending control of the MOC reflex for speech stimuli may actually be bilateral. Unlike pure tones, speech activates both left and right auditory cortices even when presented monaurally [126]. In addition, cortical gating of the MOC reflex in humans does not appear restricted to direct, ipsilateral descending processes that impact cortical gain control in the opposite ear [127]. Rather, cortical gating of the MOC reflex likely incorporates polysynaptic, decussating processes that influence/modulate cochlear gain in both ears.

Stimulus-specific enhancement of the neural representation of the speech envelope explains the pattern of CEOAE suppression Beyond enhancing spatial listening [124] and protecting from noise exposure [128–130], the function most commonly attributed to activation of the MOC reflex is “unmasking” of transient acoustic signals in the presence of background noise [19,118,124,131]. By increasing the dynamic range of mechanical [16,17] and neural [18,19] responses to amplitude-modulated (AM) components in the cochleae [15,87] (see [28] for review), suppression of cochlear gain by the MOC reflex might preferentially favour the neural representation of syllables and phonemes in speech over masking noises. While models of the auditory periphery that include efferent feedback typically demonstrate improved word recognition for a range of masking noises [pink noise [58–60], SSN [61] and BN [59]], experimental studies only sporadically report positive correlations between increased activation of the MOC reflex and improved speech-in-noise perception [29,33,34], with some even reporting negative correlations [27,35,132] or no effect at all [36,37]. Although this variability has generally been attributed to methodological differences in the measurement of OAEs [29,38], the majority of these studies have assessed contralateral activation of the MOC reflex and performance in a speech-in-noise task in separate sessions. As a result, the possibility that the MOC reflex’s automatic activation is modulated by top-down processes to maximise its functional relevance in auditory tasks has not been fully challenged. Here, we simultaneously employed degraded speech tokens as both contralateral activators of the MOC reflex and as targets in a lexical decision task and, therefore, were able to ascertain directly the involvement of the MOC reflex in both active and passive tasks. The stimulus- and task-dependent suppression of cochlear gain we observed suggests that automatic activation of the MOC reflex is indeed gated on or off by top-down modulation, with the direction of this modulation dependent on whether or not the reflex functionally benefits performance in either task, i.e., facilitating a lexical decision in an active listening task or ignoring the auditory stimulus in a passive one. Our modelling data support this conclusion by accounting for the stimulus-dependent suppression of CEOAEs through enhancement of the neural representation of the speech envelope in the AN. The apparent benefit of suppressing cochlear gain in response to envelopes of noise-vocoded words, compared to a disbenefit for words masked by background noise, is consistent with noise-vocoded words retaining relatively strong envelope modulations and that these modulations are extracted effectively through expansion of the dynamic range as cochlear gain is reduced. For both noise maskers, however, any improvement to envelope coding due to dynamic expansion of the mechanical (and neural) range applies to both speech and masker since these signals overlap spectrotemporally. Thus, the reduction of cochlear gain results in a poorer representation of the speech envelope.

Dependence of MOC reflex on SNR The results of our simulations are based on average changes in the neural correlation coefficient with efferent feedback calculated for 100 words. For individual words, however, the effect of suppressing cochlear gain was highly dependent on both the word itself and the SNR at which it was presented (Fig 2E, S3B Fig). If top-down modulation gates activation of the MOC reflex, it must account for the statistical likelihood that suppression of cochlear gain can improve the neural coding of the stimulus envelope throughout the task. A potential criterion for predicting whether activation of the MOC reflex can improve speech intelligibility is the extent to which target speech is “glimpsed” in spectrotemporal regions least affected by a background noise [133–135]. If computing SNR of the speech envelope in short-time windows underpins speech intelligibility—as proposed by several models [51,68,69]—then this could be a suitable metric by which top-down modulation of the MOC reflex is adjusted. This metric could explain why previous studies of CEOAE suppression, which share identical paradigms aside from the stimuli with different spectrotemporal content (i.e., consonant–vowel pairs), can generate opposing correlations between discrimination in noise and the strength of the MOC reflex [27,34]. Additionally, presenting words at different SNRs should impact any benefit on speech intelligibility of activating the MOC reflex and can explain reported correlations between the strength of the MOC reflex and task performance at a range of SNRs [30,38]. In our model, switching from lowest to intermediate stimulus SNRs for both noise maskers led to an increase in the average ΔρENV when efferent feedback was applied (S3B Fig). This contributed to a small improvement in envelope encoding for +10 dB SNR speech in BN. While the absence of CEOAE suppression experimentally in the active task for BN at +10 dB SNR did not reflect the modelling results (Fig 1F), only a slight majority (59/100) of ΔρENV values were positive for speech in BN at +10 dB SNR, and this may have changed were different words selected. The modelling results for the intermediate SNR in the SSN condition (+8 dB SNR) were, however, consistent with the lack of CEOAE suppression observed experimentally in the active task (Fig 1F). Where an active MOC reflex does not functionally benefit the neural representation of the speech envelope (e.g., at negative SNRs), it remains possible that it is activated in another capacity, for example, to prevent damage by prolonged exposure to loud sounds [128–130].

Reconciling stimulus-specific effects of the MOC reflex with previous studies using MAP model and efferent feedback Resolving the differences between our stimulus-dependent observations in the Matlab Auditory Periphery and Brainstem (MAP_BS) model and those of previous MAP models incorporating efferent feedback and an automatic speech recogniser (ASR) (i.e., consistent improvements to speech-in-noise recognition, irrespective of the type of noise masker) requires understanding the models’ different outputs and analyses. While MAP and MAP_BS models share many similarities [58–60,62–64,136,137], they differ in their AN fibre outputs. The MAP model generates estimated spike probabilities across populations of simulated AN fibres [137], whereas the spike train output of the MAP_BS model is more stochastic in nature and incorporates the effects of internal noise in individual AN fibres [136,138]. Here, we took advantage of the MAP_BS’s output to present stimuli of opposing polarity and compare coincident events in the resulting stochastic spike trains. Consequently, by extracting and quantifying these envelope-sensitive components with sumcor analysis (Fig 2C and 2E), we could identify how envelope encoding was affected by the suppression of cochlear gain for differently degraded words [73–76] and provide a coherent rationale for our experimental observations (Fig 2E and 2F). The ASR, on the other hand, has been previously used as a proxy for human hearing, predicting the intelligibility of test digit sequences from their AN spike probabilities using a hidden Markov model [58–60,62,139]. While both stimulus envelope and TFS cues are available to the ASR during digit classification, their respective weighting is not transparent as the ASR exploits any available cue to correctly identify the digit sequence. Therefore, a small enhancement of TFS encoding by the MOC reflex (as observed in our simulations; S5 Fig) could potentially outweigh any disbenefits to envelope encoding. This would result in improved digit identification by the ASR without explaining experimental CEOAE results for masked speech, such as we present here. Even if envelope encoding had been weighted strongly by the ASR in previous studies, its representation of the neural envelope was likely very different to the sumcor analysis. This is particularly evident when comparing the effects of MOC reflex on “natural,” clean speech in this study and that of Brown and colleagues who also used fixed attenuation of the MOC reflex [58]. Whereas we observe improved encoding of the stimulus envelope in the 4 to 8 Hz range of modulation frequencies for natural sentences (S1B Fig), the ASR’s recognition of digits in silence was reduced when the MOC reflex was active (Fig 7A in [58]). This decrease in classification accuracy arises from the ASR interpreting the effects of the MOC reflex as an overall reduction in AN output (due to suppression of the cochlea gain) rather than the generation of sparser yet more precise neural envelopes as suggested by sumcor analysis and calculation of ρENV (Fig 2E and 2F). In addition, were we to present noise-vocoded speech—a single stream like “natural,” clean speech—to the ASR, its AN output would also appear reduced with the MOC reflex, impacting the ASR’s word recognition and further highlighting its inability to explain our experimental CEOAE data.

Considerations when modelling the MOC reflex Many aspects of the MOC reflex and its circuitry remain poorly understood. For example, while we represent the active MOC reflex here as a fixed attenuation of cochlear gain (allowing a more controlled examination of the reflex’s effects on envelope encoding), it is in fact a dynamic process with multiple time courses (spanning fifty milliseconds to tens of seconds) whose purposes remain unknown [45,140–142]. Previous MAP models incorporating dynamic efferent feedback have gravitated towards slow time constants to optimise speech recognition by their ASR [59,60,62]. However, since digit recognition by the ASR always improved with the MOC reflex, whatever its chosen time constant [60], we are confident that introducing a dynamic MOC reflex would not alter the stimulus dependence of any effects we observed in our model. An added complexity of modelling MOC reflex activity is considering how top-down gating modulates the reflex and over what timescale. It remains unknown whether corticofugal inputs to the MOC reflex produce tonic activation/inhibition or only modulate the reflex’s strength as has been shown for awake versus anaesthetised animals [20,23,143–145]. Given anatomical evidence that MOC neurons sparsely innervate broad regions of the cochlea [146,147], it is possible that, through a combination of tonic top-down modulation and poor frequency tuning of efferent innervation, the fixed attenuation we have implemented across all frequency channels may closely reflect actual recruitment of the MOC reflex under active listening conditions. Finally, our model does not include neural adaptation to stimulus statistics as a potential contributing factor to speech-in-noise discrimination [86,87]. We therefore cannot discount its involvement in the lexical decision task nor show that it is sufficient for robust recognition of degraded speech without the MOC reflex as has been previously suggested [86,87,148].

Higher auditory centre activity supports coexistence of multiple strategies to achieve similar levels of performance The impact of modulating the MOC reflex was observed in the activity of the auditory midbrain and cortex. Increased midbrain activity for active noise-masked conditions was consistent with changes in magnitudes of ABRs previously reported during unattended versus attended listening to speech [149] or clicks [150,151]. This highlights the potential for subcortical levels either to enhance attended signals or filter out distracting auditory information. At the cortical level, recorded potentials were larger for all attended, compared to ignored, speech (S6A–S6C Fig), consistent with previous reports [152,153]. However, later cortical components were larger for masked, compared to noise-vocoded, speech while attending to the most extreme manipulations. Late cortical components have been associated with the evaluation and classification of complex stimuli [104] as well as the degree of mental allocation during a task [103]. Therefore, differing cortical activity likely reflects greater reliance on, or at least increased contribution from, context-dependent processes for speech masked by noise than for noise-vocoded speech. Together, differences in physiological measurements from higher auditory centres and the auditory periphery highlight the possibility of diverging pathways to process noise-vocoded and masked speech. Evidence for systemic differences in processing single, degraded streams of speech, compared to masked speech, has been reported in the autonomic nervous system [107]: where, despite maintaining similar task difficulty across conditions, masked speech elicits stronger physiological reactions than single unmasked streams. Here, we propose that 2 strategies enable isoperformance to be maintained even when stimuli are categorically different. The processing of single and intrinsically degraded streams selectively recruits auditory efferent pathways from the AC to the inner ear which, in turn, “denoises” the representation of the stimuli in the periphery (Fig 3B). By contrast, multiple streams, represented by speech in BN or SSN, appear to rely much more on higher auditory centres such as midbrain and AC for the extraction of foreground, relevant signals (Fig 3B). Given evidence of denoised auditory signals in the cortex [47,154–156], the extensive loops and chains of information between cortex, subcortical regions, and the auditory periphery in everyday listening environments [20,157–159] have not been acknowledged, nor has their candidacy as targets for hearing technologies.

[END]

[1] Url: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001439

(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL: https://creativecommons.org/licenses/by/4.0/

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/