(C) PLOS One [1]. This unaltered content originally appeared in journals.plosone.org.
Licensed under Creative Commons Attribution (CC BY) license.
url:
https://journals.plos.org/plosone/s/licenses-and-copyright
------------
Human discrimination and modeling of high-frequency complex tones shed light on the neural codes for pitch
['Daniel R. Guest', 'Department Of Psychology', 'University Of Minnesota', 'Minneapolis', 'Minnesota', 'United States Of America', 'Andrew J. Oxenham']
Date: 2022-04
Accurate pitch perception of harmonic complex tones is widely believed to rely on temporal fine structure information conveyed by the precise phase-locked responses of auditory-nerve fibers. However, accurate pitch perception remains possible even when spectrally resolved harmonics are presented at frequencies beyond the putative limits of neural phase locking, and it is unclear whether residual temporal information, or a coarser rate-place code, underlies this ability. We addressed this question by measuring human pitch discrimination at low and high frequencies for harmonic complex tones, presented either in isolation or in the presence of concurrent complex-tone maskers. We found that concurrent complex-tone maskers impaired performance at both low and high frequencies, although the impairment introduced by adding maskers at high frequencies relative to low frequencies differed between the tested masker types. We then combined simulated auditory-nerve responses to our stimuli with ideal-observer analysis to quantify the extent to which performance was limited by peripheral factors. We found that the worsening of both frequency discrimination and F0 discrimination at high frequencies could be well accounted for (in relative terms) by optimal decoding of all available information at the level of the auditory nerve. A Python package is provided to reproduce these results, and to simulate responses to acoustic stimuli from the three previously published models of the human auditory nerve used in our analyses.
Pitch, the quality of sound that distinguishes “low” sounds from “high” sounds, is of critical importance for human hearing. In addition to the role of pitch in defining musical melodies and harmony, the pitch of the human voice helps us identify talkers, attend to a particular talker in a noisy acoustic environment, and understand a talker’s intent and emotional state. Prevailing theories posit that the auditory system relies on the stimulus-driven timing of spikes in the auditory nerve, termed phase locking, to estimate pitch. Recent behavioral results, however, suggest that pitch can still be perceived at high frequencies, where phase-locked information should be highly degraded or nonexistent. To address this discrepancy, we combined behavioral testing methods with computational models of the early auditory system to probe how listeners can achieve accurate pitch discrimination at high frequencies. Optimal decoding of all available auditory-nerve information resulted in a pattern of predictions that matched (but greatly outperformed) human perceptual performance. Understanding how pitch is coded across the frequency range may help in the quest to restore accurate pitch perception in people with impaired hearing and cochlear implants.
Funding: DG was funded by University of Minnesota College of Liberal Arts Graduate Fellowship (
https://cla.umn.edu/ ), University of Minnesota Department of Psychology Summer Graduate Fellowship (
https://cla.umn.edu/psychology ), and National Institute on Deafness and Other Communication Disorders F31 DC019247-01 (
https://www.nidcd.nih.gov/ ). DG and AO were funded by National Science Foundation NRT-UtB1734815 (
https://www.nsf.gov/ ) and National Institute on Deafness and Other Communication Disorders R01 DC005216 (
https://www.nidcd.nih.gov/ ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
In the present study, we tested the hypothesis that high-frequency F0 coding is based on a rate-place code using a mixture of behavioral and computational modeling methods. First, we present simulations of auditory-nerve responses to HCTs at low and high frequencies to develop a better understanding of the types of temporal and rate-place cues available for F0 discrimination at low and high frequencies. Next, we present behavioral data that were collected to test the hypothesis that mixtures of HCTs should result in particularly poor F0 discrimination at high frequencies–a pattern of results that would be consistent with the use of a rate-place code at high frequencies. Finally, we combine the auditory-nerve models with ideal-observer analysis to generate optimal frequency difference limens (FDLs) and F0 difference limens (F0DLs) for isolated pure and complex tones, over a wide range of frequencies, and compare these predictions to behavioral thresholds from the literature.
If a rate-place code is used at high frequencies, then the presence of spectrally resolved harmonics ought to be a necessary condition for accurate F0 discrimination at high frequencies. Consistent with this prediction, listeners do not achieve accurate F0 discrimination for stimuli at high frequencies when all stimulus harmonics are unresolved [ 33 ]. Another way to reduce access to resolved harmonics is to present target HCTs in the context of spectrally overlapping masker HCTs [ 43 – 47 ]. In cases where a sufficient number of harmonics are presented simultaneously in the same frequency region, rate-place cues for resolved harmonics may be reduced or eliminated, insofar as any peaks in an average rate profile would not unambiguously reflect the presence of a single target component, even though the harmonic numbers of the target remain the same. Such stimuli should result in particularly poor F0 discrimination thresholds at high frequencies if listeners are using a rate-place code for F0 discrimination, as compared to F0 discrimination at low frequencies where a temporal code with access to TFS information may be more robust to the presence of HCT maskers.
Whereas the link between coding of TFS information in the auditory nerve and frequency discrimination of pure tones is relatively straightforward, the link between coding of temporal information and F0 discrimination of HCTs is considerably more complicated. Here, we use the term “temporal information” to refer both to TFS at the harmonic component frequencies and to other (slower) periodicities, primarily at the F0, evoked by peripheral interactions between components. Discrimination thresholds of such HCTs composed of harmonics from the 6 th to 10 th are typically poorer by approximately a factor of 5 if the component frequencies are all above ~8 kHz than if they are at lower frequencies [ 33 , 36 – 38 ]. Qualitatively, this effect is consistent with temporal theories of pitch, which predict that performance should degrade as phase locking to component frequencies (and thus the availability of TFS information about component frequencies) degrades at higher frequencies. Conversely, this effect is qualitatively inconsistent with rate-place theories of pitch. In relative terms, cochlear filters remain sharp at high frequencies; thus, the pattern of average auditory-nerve firing rate across the tonotopic axis should be informative about F0 at both low and high frequencies. For spectrotemporal theories of pitch, predictions are less clear because loss of phase locking to TFS may be counterbalanced by relatively sharper auditory filters at high frequencies. Although the difference in performance at low and high frequencies is qualitatively consistent with temporal theories, the magnitude of degradation in F0 discrimination performance at high frequencies (about a factor of 5) is surprisingly small, given that phase-locked responses to TFS above 8 kHz seem unlikely to convey sufficient information to derive accurate estimates of F0 in HCTs [ 13 ]. In the absence of TFS information at high frequencies, listeners must instead be relying on temporal-envelope periodicities at the F0, evoked by peripheral interactions between stimulus components, or they must switch to a rate-place code [ 38 , 39 ]. However, it is generally believed that listeners cannot perform pitch discrimination by comparing rates of temporal-envelope cues for high F0s/rates above about 700 Hz [ 38 – 42 ].
Harmonic complex tones (HCTs), which are comprised of pure tones whose frequencies are integer multiples of a common fundamental frequency (F0), are a more complicated but more natural pitch-evoking stimulus. Voiced speech and musical instrument sounds are examples of HCTs. Rate-place information is generally thought to be available for lower-ranked harmonics, but not for higher-ranked harmonics, due to the filtering that occurs in the cochlea. The transition between these lower, spectrally resolved, harmonics and the higher, spectrally unresolved, harmonics is also subject to debate but, depending on the definition, is thought to occur somewhere between the 7 th and 10 th harmonic, at least for F0s of 100 Hz and above [ 13 , 31 – 35 ]. Behaviorally, F0 discrimination is best when some harmonics lower than the 10 th are present [ 31 – 34 ]. For this reason, we concentrate on HCTs that are restricted to a limited number of harmonics in the range from the 6 th to 10 th .
The simplest pitch-evoking stimulus is the pure tone, and it is well known that frequency discrimination of pure tones degrades as the stimulus frequency increases beyond 2–3 kHz [ 21 , 22 ]. Because phase locking in the auditory nerve also weakens with increasing stimulus frequency beyond 2–3 kHz [ 23 – 26 ], it has often been argued that frequency discrimination relies on a temporal code. Ideal-observer analysis of simulated auditory-nerve responses suggests that the rolloff of phase locking in the auditory nerve can account well for the dependence of pure-tone frequency discrimination on stimulus frequency in humans [ 27 , 28 ]. However, no direct evidence regarding the lowpass characteristic of phase locking is available in humans, with estimates based on comparative studies, electrophysiology, and psychophysics of the “upper limit” of useful phase locking ranging quite widely from 1.5 kHz up to 8–12 kHz [ 21 , 23 , 25 , 29 ]. In addition, new behavioral results [ 30 ] have resulted in considerable uncertainty surrounding the extent to which the deterioration of frequency discrimination at high frequencies truly reflects the underlying rolloff of auditory-nerve phase locking to TFS. Nevertheless, at a sufficiently high (although unknown) frequency, no usable phase-locked information should be available in the auditory nerve. At such a point, it is generally believed that a rate-place code for frequency becomes dominant [ 21 ].
Pitch is a primary perceptual dimension of sound. It plays a key role in the perception of music, where it constitutes the basis of melody and harmony [ 1 ], as well as in the perception of speech, where it has important suprasegmental functions and conveys information about talker identity [ 2 – 4 ]. Pitch also facilitates auditory scene analysis, helping listeners to segregate simultaneous harmonic sounds [ 5 , 6 ] or to understand speech in complex backgrounds [ 7 ]. Although sensitivity to pitch and regular harmonic structure has been demonstrated in auditory cortex of humans [ 8 – 10 ] and other mammals [ 11 , 12 ], theories of the neural basis of pitch perception diverge as early as the auditory nerve. “Place” or “rate-place” theories contend that pitch is derived by analysis of the spatial pattern of average firing rates of auditory-nerve fibers, in which information about the frequency content of a stimulus is encoded via the basilar membrane’s frequency-to-place (or tonotopic) mapping [ 13 , 14 ]. “Temporal” theories suggest instead that pitch is derived from temporal information, including temporal fine structure (TFS) information encoded in inter-spike intervals by the phase-locking properties of auditory-nerve fibers and other temporal information, such as envelope modulation [ 13 , 15 , 16 ]. “Spatiotemporal” or “spectrotemporal” theories, motivated by the fact that neither place nor temporal theories account well for all pitch phenomena, propose that both the frequency-to-place mapping and TFS information play crucial roles in pitch perception [ 13 , 17 – 20 ].
Results
Peripheral representation of HCTs at high frequencies Although many studies have recorded auditory-nerve responses to the types of HCT stimuli used in pitch experiments [13,20,48,49], these have not included HCTs at the very high frequencies used in recent human psychophysical work [36–38]. Moreover, recent work has revealed significant differences in peripheral coding between humans and the smaller mammals commonly used in auditory physiology experiments [29,50,51], raising questions as to how useful auditory-nerve recordings of pitch-evoking stimuli in animals such as guinea pigs or chinchillas are in understanding how pitch stimuli are represented in the human auditory periphery. To develop a better understanding of the availability and quality of different F0-related cues at high frequencies as compared to low frequencies in the human auditory periphery, we simulated human auditory-nerve responses to HCTs at low and high frequencies using a modern phenomenological model of the auditory nerve [52], with parameters adjusted to match what is known about human cochlear tuning. This was done after first validating that the cat version of the model could qualitatively replicate key data from relevant studies in cat (S1 Text). Low-numbered harmonics were at least partially resolved by the model filterbank (emulating the mechanical filtering of the basilar membrane) at both low and high F0s, as reflected by prominent peaks in the pattern of average firing rates over characteristic frequency (CF) in the auditory-nerve population (Fig 1A and 1B, right panel). At low F0s, model fibers tuned to resolved components also demonstrated robust phase locking to the underlying TFS, whereas fibers tuned to unresolved components instead responded with a prominent modulation at F0 (Fig 1A, bottom panel). At high F0s, component frequencies were too high to produce phase-locked responses in the model auditory nerve. As a result, model fibers tuned to resolved components showed responses with little in the way of temporal structure, whereas fibers tuned to unresolved harmonics showed strong modulations at F0 (Fig 1B, bottom panel). At both low and high frequencies, model fibers tuned between resolved components showed responses modulated at F0 (Fig 1A and 1B, bottom panel). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 1. Representation of complex tones in a simulated auditory nerve. (A). Simulated responses of a population of high-spontaneous-rate auditory-nerve fibers [52] for five periods of a sine-phase HCT composed of harmonics 4–13 at a level of 50 dB SPL per component and an F0 of 280 Hz. The middle panel shows a “neurogram”, or a plot of instantaneous firing rate as a function of time and characteristic frequency. In the neurogram, color (from purple [low] to yellow [high]) indicates the instantaneous firing rate analogous to color indicating intensity in a spectrogram. The top and left panels show the temporal waveform and spectrum, respectively, of the acoustic stimulus. In the left panel, the yellow box highlights the frequency ranges used in the behavioral experiments. The bottom two panels and right panels show the responses of individual nerve fibers over time and the response profile averaged over time, respectively, of the simulated neural response. The bottom two panels show responses for auditory-nerve fibers tuned to component frequencies (4F0, purple; 12F0, blue) or tuned between component frequencies (5.5F0, pink; 9.5F0, maroon). (B). Same as A, except for an F0 of 1400 Hz.
https://doi.org/10.1371/journal.pcbi.1009889.g001 The model simulations suggest that if temporal information is utilized at high frequencies, then it is likely to be based on the envelope modulations at F0, rather than on the phase-locked responses to individual harmonics (an observation further reinforced by our ideal-observer modeling below). However, several lines of evidence suggest that humans cannot utilize that envelope information effectively. First, studies using HCTs composed of unresolved harmonics (i.e., all harmonics above the 10th) or modulated noises show that the pitch of such stimuli, which is conveyed exclusively by temporal-envelope cues, is weak, yields poor F0 discrimination, and is non-existent for F0s above about 700 or 800 Hz [32–34,40,42]. Second, melody discrimination is possible at very high frequencies (> 7.5 kHz) for HCTs with F0s between 1 and 2 kHz, but when the harmonics are shifted to produce inharmonic tones, performance drops to near chance, even though the temporal-envelope cues are maintained [39]. Third, both melody perception and F0 discrimination remain accurate at high frequencies when odd and even harmonics are presented to opposite ears, even though the temporal-envelope repetition rate in each ear is thereby doubled [38,39]. These lines of evidence, which we evaluate and reconsider further in light of our own results below, suggest that at least some information in the temporal response pattern of the auditory nerve at high frequencies may not be utilized perceptually.
F0 discrimination is affected by HCT maskers at low and high frequencies If listeners are not using temporal information to perform F0 discrimination at high frequencies, then they must be relying on rate-place information, which in turn relies on the presence of some spectrally resolved harmonics. We tested this idea by attempting to restrict the availability of resolved harmonics via the addition of concurrent HCT maskers. We hypothesized that this stimulus manipulation would yield particularly poor performance at high frequencies (where resolved harmonics are necessary due to the use of a rate-place code) as compared to low frequencies (where resolved harmonics may not be necessary, due to the additional availability of TFS cues, even in the presence of the masker). Experiment 1. We began by measuring F0 discrimination at both low and high frequencies and by determining how a single HCT masker impaired F0 discrimination. Experiment 1 measured F0 discrimination thresholds for bandpass-filtered HCTs at low frequencies (F0 = 280 Hz, frequency range = ~1.5–3.0 kHz) and high frequencies (F0 = 1400 Hz, frequency range = ~8–14 kHz). Two conditions were tested: ISO, in which test tones (target and reference) were presented in isolation (except for masking noise in the background), and GEOM, in which target tones with masking noise were presented concurrently with a spectrally overlapping HCT masker with an F0 that was geometrically centered between the F0s of the reference (lower-F0) and the target (higher-F0) tone. The masking noise was broadband threshold-equalizing noise (TEN) [53] with a level within the estimated equivalent rectangular bandwidth (ERB) of the human auditory filter around 1 kHz [54] that was 10 dB below the level per component of the HCTs. Two stimulus variants were tested. In the first variant (Experiment 1a), the test tones contained only harmonics 6–10 of the F0. To help rule out the possibility that listeners were using the spectral edge of the stimulus, rather than the F0, to complete the task, a second variant (Experiment 1b) was tested, in which the test tones contained all harmonics of the F0 up to the Nyquist frequency (i.e., half the sampling rate), but were bandpass filtered with a zero-phase 12th order Butterworth bandpass filter passing harmonics 6–10 of the nominal F0. The results of Experiment 1 are plotted in Fig 2. An analysis of variance (ANOVA) revealed significant main effects of F0 (low or high) [F(1, 22.98) = 54.57, p<0.001] and masker (present or absent) [F(1, 22.46) = 149.37, p<0.001] as well as significant two-way interactions between F0 and masker [F(1, 22.29) = 44.71, p<0.001] and between masker and experiment (Experiment 1a or Experiment 1b) [F(1, 130.42) = 9.02, p = 0.013]. No other model terms reached significance. The significant main effects reflected the trends observed in Fig 2 that listeners achieved better (lower) F0 discrimination thresholds at low frequencies than at high frequencies both for the ISO condition [estimated ratio = 5.14, F(1, 22.82) = 63.41, p<0.001] and the GEOM condition [estimated ratio = 2.14, F(1,22.82) = 29.80, p<0.001] and that they achieved better F0 discrimination thresholds in the absence (ISO) than in the presence (GEOM) of the masker for both low frequencies [estimated ratio = 3.70, F(1, 22.69) = 210.15, p<0.001] and for high frequencies [estimated ratio = 1.54, F(1, 22.49) = 17.00, p<0.001]. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 2. Behavioral results from Experiment 1. The left panel (Exp1a) shows results with only harmonics 6–10 present in the target; the right panel (Exp1b) shows results with the bandpass filtered target, with harmonics 6–10 in the passband. Large filled circles and error bars indicate the average F0DL and ±1 standard error of the mean (SEM). The small filled circles and error bars indicate individual F0DLs and ±1 SEM for each participant.
https://doi.org/10.1371/journal.pcbi.1009889.g002 Our initial hypothesis was that the addition of a masker (i.e., the GEOM condition) would reduce the availability of rate-place cues and so would result in particularly poor F0 discrimination at high frequencies. In fact, contrary to our hypothesis, an interaction contrast test revealed that the difference in performance between the ISO and GEOM conditions was larger at low frequencies than at high frequencies [estimated ratio of ratios = 2.40, F(1, 22.29) = 42.71, p<0.001]. There are many possible explanations as to why the GEOM masker worsened performance more at low frequencies than at high frequencies. One possibility is that the smaller effect of the GEOM masker at high frequencies reflects a ceiling effect for pitch discrimination in our task. That is, for large F0 differences, listeners may have relied on changes in gross spectral cues instead of on TFS information or spectral details. As a result, there may have been an upper limit to how poor F0DLs could be in our stimulus conditions, and if listeners reached such a limit in the high-frequency GEOM condition, the ISO-GEOM ratio in the high-frequency condition may have underestimated the true impact of the GEOM masker at high frequencies. Another possibility is that the GEOM masker did not achieve its intended goal of eliminating representations of resolved harmonics in the neural response to the stimulus; this possibility is explored further via modeling described below. Somewhat different stimuli were used in Experiment 1a and Experiment 1b. Specifically, the strong spectral edges cues present in the stimuli for Experiment 1a (which were systematically related to the F0 on each trial) were replaced with sloping spectral edges in Experiment 1b (which were not systematically related to the F0 on each trial) by using a bandpass filter. As indicated by the significant interaction between masker and experiment, this difference affected performance, with the effect of the masker being approximately 1.4 times larger on average in Experiment 1a than in Experiment 1b [estimated ratio of ratios = 1.4, F(1, 130.42) = 9.02, p = 0.016]. From visual inspection of Fig 2, the difference between Experiments 1a and 1b appears constrained to the high-frequency GEOM condition. However, the three-way interaction between F0, masker, and experiment was not significant, and a series of pairwise contrasts between Experiments 1a and 1b for each of the conditions did not reveal any significant differences after correction for multiple comparisons (all p>0.054). Collectively, the small size of the observed differences between Experiment 1a and Experiment 1b, and the fact that performance was, if anything, better in Experiment 1b than in Experiment 1a, suggests that listeners were not using spectral edge cues in Experiment 1a.
Experiment 2 It is possible that the GEOM masker may not have achieved its intended goal of eliminating rate-place representations of resolved target harmonics in the neural response. Experiment 2 attempted to address this possibility by measuring the target-to-masker ratio (TMR) that listeners required to discriminate the F0 of HCTs presented concurrently with two spectrally overlapping HCT maskers. The test tones (reference and target) had F0s that were separated by 1.5 or 2.5 times the F0DL measured for each participant individually without a masker. The masker tones had F0s that were below and above the F0 of the test tones (by between 5.25–7.25 semitones, selected randomly on each trial with a uniform distribution), and auditory-nerve simulations (see below) confirmed that the target harmonics were unlikely to be spectrally resolved at a TMR of 0 dB (equal-amplitude target and masker components). The targets and maskers were both synthesized in the same way as the tones in Experiment 1b (i.e., containing all harmonics of their F0 but bandpass filtered to attenuate all but harmonics 6–10 of the target or 5–11 of the maskers). The results of Experiment 2 are shown in Fig 3. An ANOVA revealed significant main effects of F0 (low or high) [F(1, 10.00) = 11.78, p = 0.013] and interval size (1.5 or 2.5 F0DL) [F(1, 10.00) = 17.43, p = 0.0057] as well as a significant interaction between F0 and interval size [F(1, 10.00) = 6.77, p = 0.026]. Listeners achieved considerably lower TMRs in low-frequency conditions than in high-frequency conditions [estimated difference = −3.66 dB, F(1, 10.00) = 11.78, p = 0.019]. This frequency effect was present even though the difference in F0 was set for each listener based on their own F0DLs from the corresponding ISO condition in Experiment 1, thus nominally equating difficulty across low- and high-frequency conditions in the absence of the masker. In other words, the presence of two HCT maskers interfered more with pitch discrimination at high frequencies than at low frequencies. Under the assumption that rate-place cues for F0 were successfully eliminated by the DBL masker, this finding is qualitatively consistent with our hypothesis that F0 discrimination at high frequencies is based on a rate-place code and so should be more strongly disrupted by the reduction or elimination of spectrally resolved harmonics. This conclusion, and the assumptions underlying it, are considered in more detail in the following section. PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 3. Behavioral results from Experiment 2. Results from Experiment 2. Large filled circles and error bars indicate the average TMR and ±1 standard error of the mean (SEM). The small filled circles and error bars indicate individual F0DLs and ±1 SEM for each participant.
https://doi.org/10.1371/journal.pcbi.1009889.g003 As expected, listeners generally performed better with the larger interval size (2.5 F0DL) than with the smaller interval size (1.5 F0DL), as confirmed by a contrast test [estimated difference = −1.52 dB, F(1, 17.43) = 17.43, p = 0.0076]. However, an unexpected interaction between F0 and interval size revealed that the larger interval size yielded better performance at low frequencies [estimated difference = −2.60 dB, F(1, 1.00) = 21.21, p = 0.0049] but not at high frequencies [estimated difference = −0.44 dB, F(1, 10.00) = 0.68, p = 0.43]. An interaction contrast test comparing the size of the interval effect at low and high frequencies was, after correction, marginally significant [estimated difference of differences = 2.16 dB, F(1, 10.00) = 6.77, p = 0.053], providing modest evidence that the size of the interval effect differed between low and high frequencies, although in both cases the trend was in the same direction, with a larger interval producing a lower TMR at threshold.
F0 discrimination Ideal-observer predictions for F0DLs with harmonics 6–10 are shown in Fig 8. The all-information observer predicted that F0 discrimination should be best for F0s in the range of 200–500 Hz (with the precise range depending on the underlying auditory-nerve model) and then steadily worsen with increasing F0 before plateauing around 600–1000 Hz (Fig 8A), at least in the Heinz et al. [27] and Zilany et al. [52] models. For the present stimuli, the 6th harmonic was the lowest present, so that an F0 of 600 Hz corresponded to a lowest harmonic of 3600 Hz, whereas an F0 of 1000 Hz corresponded to a lowest harmonic of 6000 Hz. The model fibers do not phase lock strongly to frequency components above about 4 kHz (S1 Text); thus, this pattern of results is suggestive of a transition from more accurate TFS-based coding at lower F0s to less accurate envelope-based coding at higher F0s in the all-information observer (see no-masker simulations in Fig 5C). PPT PowerPoint slide
PNG larger image
TIFF original image Download: Fig 8. Ideal-observer predictions for F0 discrimination. Results of the F0 discrimination simulations. (A) Simulated F0DLs versus F0 of the ISO HCT stimulus in each auditory-nerve model. Simulations in this panel include no parameter roving. Points indicate the simulated F0DLs at a particular F0 while lines indicate a locally estimated scatterplot smoothing (LOESS) fit to the simulated F0DLs. (B) Simulated all-information F0DLs and vector strength (top row) and simulated rate-place F0DLs and Q10 (bottom row) versus frequency with a double y-axis. To choose the warping on the y-axis for vector strength and Q10, linear models were fit to predict log-transformed F0DLs as a function of log-transformed reciprocals of vector strength or Q10 for the all-information F0DLs or rate-place F0DLs, respectively. The fitted regression equations were then used to warp the y-axes. In other words, we warped the y-axes for vector strength and Q10 to maximize overlap with the model predictions (across all three models) in order to visually demonstrate the relationship between vector strength and Q10 and the simulated F0DLs (C) Ratio of simulated F0DLs at 1.4 kHz and 0.28 kHz in the non-roved simulation at 30 dB re: threshold for each model (left) and ratio of behavioral estimates of F0DLs at 1.4 kHz and 0.28 kHz from various studies (right). Simulated F0DLs were interpolated using LOESS while behavioral F0DLs were linearly interpolated on log-log coordinates.
https://doi.org/10.1371/journal.pcbi.1009889.g008 As expected, and consistent with the results for frequency discrimination, the rate-place observer thresholds were higher than the all-information observer thresholds. Nevertheless, as with the pure-tone FDL predictions, the rate-place thresholds were still better than the average behavioral thresholds, suggesting that, in principle, sufficient rate-place information is available at the level of the auditory-nerve to account for the accuracy of behavioral F0 discrimination at both low and high frequencies. Performance for the rate-place observer was generally flat with frequency or slightly improved with higher frequencies, depending on the exact model in question. These inter-model differences are a direct result of differences in peripheral tuning between the models (Fig 8B and S1 Text). The rate-place observer tended to be more sensitive to changes in level than the all-information observer (Fig 8A). Generally, rate-place thresholds were poorer at higher levels, consistent with the fact that higher levels degrade representations of stimulus harmonics in the average rate responses of auditory-nerve fibers [61]. At much higher levels, rate-place thresholds worsened significantly, consistent with saturation of the simulated high-spontaneous-rate fibers at high levels (S3 Text). There is some evidence that F0 discrimination (in noise) worsens with increasing level [62]; however, this effect was observed at higher levels behaviorally than those tested in the present model simulations. Both the all-information observer and the rate-place observer models were affected by parameter roving. Level roving typically had a negative impact on the rate-place observer at higher frequencies but little impact on the all-information observer (S2 Text). In comparison, phase randomization (i.e., randomizing the relative phases of the components from stimulus to stimulus instead of always synthesizing components in sine phase) had no impact on the rate-place observer but had a negative impact on the all-information observer (S2 Text). Generally, across a fairly wide range of levels and frequencies, phase randomization elevated all-information thresholds by a factor of two, consistent with the predictions of Siebert [28] and the frequency discrimination results. In behavioral data, phase randomization only affects thresholds when all the stimulus components present are unresolved, as expected [31,32]. To compare the model results to behavioral data, we extracted discrimination thresholds from three behavioral studies that tested the same F0s as in our study [36–38] and then calculated the ratio between F0 discrimination thresholds for F0s of 1.4 kHz and 0.28 kHz for each study as well as the present study (Fig 8C; for the present study, data was pooled across Experiment 1a and Experiment 1b, see Materials and Methods for details). In contrast to the varying stimuli and methods of the selected pure-tone frequency discrimination studies, the selected F0 discrimination studies all used essentially the same stimuli and methods. The ratios between F0 discrimination thresholds at 1.4 kHz and 0.28 kHz for each study ranged from approximately 3 [36] to approximately 5 [37]. These values are within a range that can be plausibly explained by the degradation of the all-information observer with increasing frequency, with the same ratio for the all-information observer ranging from approximately 3 for the Verhulst et al. [55] model to over 10 for the Heinz et al. [27] model. In contrast, the rate-place observer model again predicted threshold ratios of around 1, which is outside the range observed in behavioral data. As was the case for frequency discrimination, the all-information observer provided the best overall match to human data in terms of relative performance at low and high frequencies. The plateau in the predicted F0 discrimination thresholds of the all-information observer above about 600 Hz in the Heinz et al. [27] and Zilany et al. [52] models (in contrast to the progressive degradation of predicted pure-tone frequency discrimination thresholds at high frequencies; Figs 7 and 8) suggests that, at higher F0s once phase locking to TFS becomes unavailable, the all-information observer relies on temporal cues produced by peripheral interaction between adjacent harmonics rather than phase locking to TFS at the component frequencies (Fig 1B). The correspondence between all-information trends and human behavior could be interpreted as support for the use of temporal-envelope cues at high F0s by humans. However, as discussed earlier, several lines of evidence from pitch psychophysics pose challenges to this interpretation. Specifically in the case of the high-F0 HCTs with harmonics in the range of the 6th to the 10th, pitch perception is robust to manipulations in which the even and odd harmonics are presented to different ears (which would decrease modulation depth and double the modulation rate in each ear [38]) but is impaired by harmonicity manipulations in which the frequencies all components are shifted up by a constant amount in Hz (which preserves modulation rates while rendering the tone inharmonic [38,39]). Although beyond the scope of this paper, auditory-nerve simulations and ideal-observer analysis could help further clarify our understanding of these conditions. More generally, for HCTs composed of unresolved harmonics and amplitude-modulated noises (for which pitch is thought to be exclusively conveyed by temporal-envelope cues), humans are not able to achieve accurate pitch discrimination for F0s (or modulation rates) beyond about 600–700 Hz [40,42]. These lines of evidence seem to rule out a class of simple models wherein listeners compare temporal-envelope rates to discriminate F0 in the present task. However, there are several notable differences between the present HCT stimuli and the stimuli used to probe temporal-envelope pitch (unresolved HCTs and modulated noises) that are worth considering further. First, whereas for unresolved HCTs and modulated noises the only reliable pitch cue is the temporal-envelope rate, the present HCT stimuli consist of partially resolved harmonics, which provide rate-place cues in addition to any temporal cues elicited by peripheral interaction between stimulus components. Second, whereas for unresolved HCTs and modulated noises, all auditory-nerve fibers tuned to the stimulus should have strongly modulated responses, in the present HCT stimuli, modulation power varies substantially over the range of CFs, with deeper modulations in channels tuned between stimulus components (Figs 1 and 5). As a result, for unresolved HCTs or modulated noises, the auditory system is relegated to comparing temporal-envelope modulation rates to perform discrimination. In contrast, for the present HCT stimuli, the auditory system could derive F0 estimates by combining information provided by average ANF rates (the excitation pattern) with information provided by the distribution of envelope modulation power across CFs (the so-called “fluctuation profile”; [63]). Recent modeling studies suggest that such a strategy, based on decoding of simulated fluctuation profiles at the level of the midbrain, can in principle account for behavioral F0 discrimination in stimuli similar to the present low-frequency HCT stimuli [64] as well as performance in other psychoacoustical tasks [65]. At the same time, these prior modeling studies have focused on midbrain neurons tuned to modulations on the order of around 100 Hz, while stimulus-envelope modulations in the present high-frequency HCT stimuli are at much higher rates (> 1 kHz) where sensitivity to amplitude modulations is thought to be limited [66–68]. Further investigation will be needed to determine whether the fluctuation-profile approach can be successfully extended to our high-frequency stimuli and reconcile the trends in the ideal observer with what is known from pitch psychophysics.
[END]
[1] Url:
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009889
(C) Plos One. "Accelerating the publication of peer-reviewed science."
Licensed under Creative Commons Attribution (CC BY 4.0)
URL:
https://creativecommons.org/licenses/by/4.0/
via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/