Measuring access to high-modulation-rate envelope speech cues in clinically fitted auditory prostheses.

The signal processing used to increase intelligibility within the hearing-impaired listener introduces distortions in the modulation patterns of a signal. Trade-offs have to be made between improved audibility and the loss of fidelity. Acoustic hearing impairment can cause reduced access to temporal fine structure (TFS), while cochlear implant processing, used to treat profound hearing impairment, has reduced ability to convey TFS, hence forcing greater reliance on modulation cues. Target speech mixed with a competing talker was split into 8-22 frequency channels. From each channel, separate low-rate (EmodL, <16 Hz) and high-rate (EmodH, <300 Hz) versions of the envelope modulation were extracted, which resulted in low or high intelligibility, respectively. The EModL modulations were preserved in channel valleys and cross-faded to EModH in channel peaks. The cross-faded signal modulated a tone carrier in each channel. The modulated carriers were summed across channels and presented to hearing aid (HA) and cochlear implant users. Their ability to access high-rate modulation cues and the dynamic range of this access was assessed. Clinically fitted hearing aids resulted in 10% lower intelligibility than simulated high-quality aids. Encouragingly, cochlear implantees were able to extract high-rate information over a dynamic range similar to that for the HA users.

Target speech mixed with a competing talker was split into 12-22 frequency channels. From each 23 channel, separate low-rate (EmodL, < 16 Hz), and high-rate (EmodH, < 300 Hz) versions of the envelope 24 modulation were extracted, which resulted in low or high intelligibility, respectively. The EModL 25 modulations were preserved in channel valleys, and cross-faded to EModH in channel peaks. The cross-26 faded signal modulated a tone carrier in each channel. The modulated carriers were summed across 27 channels and presented to hearing-aid and cochlear-implant users. Their ability to access high-rate 28 modulation cues, and the dynamic range of this access, was assessed. Clinically fitted hearing aids 29 resulted in 10% lower intelligibility than simulated high-quality aids. Encouragingly, cochlear implantees 30 were able to extract high-rate information over a dynamic range similar to that for the hearing-aid users.

35
The dynamic range of acoustic signals processed by the healthy human hearing system, spanning 36 the range between audibility and discomfort, is around 100 dB, but less below 200 Hz and above 10 kHz.

37
With sensorineural hearing impairment, discomfort thresholds exhibit a wide scatter for the same degree 38 of impairment, but show little increase with increasing impairment until hearing threshold exceeds 60 dB 39 HL (Storey & Dillon, 1998). This reduced dynamic range between audibility and discomfort causes 40 recruitment, (Fowler, 1936;Steinberg & Gardner, 1937), remediated in hearing aids (HA) by the use of 41 multi-channel dynamic range compression (DRC) (Villchur, 1973;White, 1986). For profound hearing 42 losses, direct electrical stimulation of the cochlea can be used to replace acoustic stimulation by using a 43 cochlear implant (CI). Since the dynamic range between threshold of hearing and threshold of discomfort 44 for the electrical signals presented is typically between 5 and 30 dB (Fu & Shannon, 1998;Loizou et al., 45 2000;), DRC is also essential in CIs (Dillier et al., 1980;Wilson et al., 1988;) 46 The effectiveness of DRC in the remediation of hearing impairment has resulted in debate 47 (Plomp, 1988;Villchur, 1988) and much experimentation (see reviews in Souza, 2002;Moore, 2008).

48
Although the concept of DRC is old, the flexibility of digital signal processing has seen multiple fields of 49 applicability and configurations proposed for the design of DRC circuits, initially in broadcast audio 50 (McNally, 1984;Stikvoort, 1986;Giannoulis et al., 2012), but extending to hearing aids (reviews in 51 Kates, 2005;Dillon, 2012a), and cochlear implants (Stöbich et al., 1999  Generally, the restoration-of-audibility-promotes-intelligibility argument proposes the use of fast 58 time constants in multiple channels of DRC (e.g. Villchur, 1988) in order to promote audibility of low-59 level portions of the signal. The low-distortion-promotes-intelligibility argument favours slower time constants, preserving the fidelity of envelope modulations (Plomp, 1988) and in the process, sacrificing 61 short-term audibility. However, this sacrifice appears beneficial in noisy listening situations and where 62 the richness of acoustic cues is impoverished, such as in noise-carrier vocoding (Stone & Moore, 2007).

63
It should also be noted that, apart from DRC, other non-linear signal processing that is used in hearing 64 prostheses, such as adaptive noise reduction and adaptive directional microphones, distorts envelope 65 modulation. Also, like DRC, these are implemented on a multi-channel basis and with varying time 66 constants (Dillon, 2012b). Depending on choice of time constants, these additional signal processing 67 strategies may also be expected to contribute to modulation distortion on perceptually-relevant timescales.  non-linear signal processing as contributing to the "noise", so their model requires a reference signal of 85 the clean (unprocessed) target, which is not always practical. The alternative approach, used in the Jørgensen and Dau family of models (which also require a reference signal, but of the noise alone), only 87 generates an estimate of the audible speech modulations, thereby ignoring intermodulations between 88 target and background produced by non-linear processing. However, the estimated preserved speech 89 modulations will incorporate effects of any non-linear processing, such as the reduced dynamics, and 90 hence altered (long-term) SBR (Naylor & Johannesson, 2009). The successful prediction of results from 91 these models indicate that reduction of the signal envelope power, as well as the addition of distortions 92 and noise, is a major contributor to speech intelligibility, even without consideration of the supporting 93 role of TFS (reviewed in Moore, 2014).

94
For a wearable hearing prosthesis, there is a need to assess the perceptual consequences, if any, of 95 the device, and assigning the cause of the cost. In doing so, it is possible to identify areas where the 96 technical design of the prosthesis, rather than perceptual limitations of the participant, affect performance.

97
For example, in a speech-intelligibility task, normal-hearing listeners were assessed either wearing 98 hearing aids or unaided (Cubick & Dau, 2016;Cubick et al. 2018). The wearing of high-fidelity hearing 99 aids appeared to show little or no disadvantage in a co-located masker condition, but produced worse 100 performance in a separated masker condition (Cubick et al., 2018), when compared to unaided listening.

101
This pattern of results was attributed to distortion of spatial cues due to the non-ideal location of the 102 microphones (behind-the-ear location and omni-directional pattern). These differences were smaller than 103 those measured when a lower-bandwidth, lower-fidelity hearing aid was used (Cubick & Dau, 2016).

104
Besides the non-linearities often produced by analogue acoustic transducers and their amplifiers, the 105 distortions introduced by signal processing can also be expected to produce modulation-and inter-106 modulation-distortion components. While these are physical components, their sources of origin, and the 107 relationship between them can also cause perceptual confusion. Stone et al. (2009) reported that the 108 action of fast-acting DRC on a two-talker mixture required greater effort on the part of young, NH, 109 listeners to separate the keywords from the mixture. This they attributed to a loss of independence in the 110 separate modulations patterns of the component talkers: previously independent sound sources had acquired a common component of modulation due to the fast-acting DRC, perceptually making them 112 appear to be less separate.

113
One way to assess the degree of the perceptual consequences possibly produced by signal 114 processing on modulations is by manipulating the ability of the listener to access them. High-rate 115 envelope modulations (greater than about 15 -30 Hz) appear to be an important contributor to speech 116 intelligibility, at least in vocoder processing (Dudley, 1936;Whitmal, 2007;Souza & Rosen, 2009). 117 Stone et al. (2010) reported measures of manipulating this perceptual access. They band-pass filtered a 118 speech-in-competing-talker signal into either 8 or 15 contiguous channels. Within each channel, they 119 low-pass filtered the full-modulation-bandwidth envelope to produce a restricted-modulation-bandwidth 120 version of the same envelope. The resulting envelope was used to modulate a tone carrier at the centre of 121 the respective channel, before recombining the individual channels. When the full bandwidth envelope 122 signal was used, the resulting intelligibility was high, but it fell markedly when the restricted-bandwidth 123 version was used. Additionally, Stone et al. selectively switched in the restricted-bandwidth version as a 124 function of short-term signal level within a channel. In one of their configurations, as the channel signal 125 valleys were progressively filled with restricted-bandwidth information, intelligibility progressively 126 decreased, mapping the ability of the listener to access information in the signal dips. This mapping, 127 relating intelligibility to the relative level of the switch from restricted to full-bandwidth, can be used to 128 define an intensity-importance function (IIF, Boothroyd, 1990), a description of the relative importance of 129 speech information as a function of level in a signal channel. Boothroyd

158
The second part of the experiment used the same processing technique with a group of CI users.

159
Compared to HA processing, the envelope distortion produced by CI processing is more complicated. In 160 a CI, at least two stages of DRC are employed, the first acting on the short-term signal in a small number 161 of frequency bands, and the second applying instantaneous DRC to the extracted channel envelope 162 applied at each electrode (Wilson et al., 1988;Fu & Shannon, 1998). Instantaneous DRC can be expected to generate a whole series of distortion components in the modulation frequency domain. However, the 164 ability to present the channel signal directly to frequency-specific regions of the cochlea, bypassing the 165 channel mixing that has to occur prior to acoustic presentation, means that the distortion components may 166 be at least partly cancelled out by the instantaneous non-linearity at the electrode-neural interface, if the 167 correct mapping function is chosen. Early work with CIs showed that, although this mapping function 168 does affect intelligibility, an exact match of the function mapping envelope amplitude to electrode current 169 was not important to produce the highest intelligibility (Fu & Shannon, 1998       Additional prerequisites for eligibility for this group were:1) that they should be successful users of their 211 device, capable of achieving moderate to high sentence intelligibility (> 70%) at SBRs of +10 dB or less 212 (speech-spectrum weighted noise).
2) that they had been using the device regularly for at least 1 year (> 8 hours/day, 5 days/week) and were 214 happy with the fit.

216
The requirement for CI participants to be successful users of their device was necessary due to 217 the need to test at SBRs where: 218 (a) within each channel, the (fluctuating) background would overlap with the dynamic range of the 219 speech-plus-background signal that had previously been shown to be relevant for HI listeners (Stone et 220 al., 2012a), typically between about -8 and +8 dB relative to the channel rms, and 221 (b) their word-intelligibility scores were sufficiently high (> 50%) that the participant was not 222 demotivated by their apparent lack of ability in some (necessarily) low-scoring conditions. 223 CI participants with unilateral implants were tested through their implant alone. 18 CI 224 participants (8 female) completed the testing. Their details are given in Table II

256
In each channel a logical 'switching signal' (binary-valued, 0 or 1), was created by comparing the 257 instantaneous value of the L-filtered envelope with an adjustable 'switching threshold'. The switching 258 signal was defined as being 1 if the L-filtered envelope was above the switching threshold and 0 259 otherwise. The switching signal was then filtered with a 2-pole, minimum overshoot Bessel-derived low-260 pass filter, whose corner frequency was twice that of the L filter, to give a 10-90% rise time of 11.5 ms, 261 except for low-bandwidth channels, where the corner frequency was scaled so that the rise time was three 262 times the reciprocal of the channel centre frequency, so as to reduce the potential for production of high-263 level in-channel modulation products.

287
These percentages are averages across channels 4 to 12 of the 16-channel HA processing, using an SBR 288 of +8 dB, the average across the HA participants when tested using their own aids. These channels span the frequency range 400 to 4000 Hz, (see Table III

308
Two versions of the processed training and test speech signals were generated, one with, and one 309 without, the REIG applied to the source speech+interfering speaker signal before the vocoder processing.

379
The participant was encouraged to set the processor controls in anticipation, such that 380 conversational speech from the experimenter was at a comfortable level, and to choose their regular 381 clinical program for coping with speech in a moderately noisy environment. Once selected, these settings 382 were checked for comfort during initial training, but not changed throughout the duration of the 383 experiment itself. All signal input to the processor came via the device microphone(s).   Table IV(a) for the HA  participants, and Table IV(b) for the CI participants. Table IV also  In general the data for the CI participants were much noisier than for the HA participants, with some data 513 difficult to interpret (C1 and C14). C9 and C13 showed an excellent ability to use the high-rate cues, with 514 a near-50% change in intelligibility, in the same range as achieved by the best HA participants. It was 515 more common for the CI participants to exhibit changes of 25 % or less (10/18 participants). The mean  The perceptual 'cost' of a clinically fitted versus a high-quality simulated hearing aid   522 Due to the bypassing of the miniature microphone and receiver in the SIM aid, as well as a lack of non-523 linear signal processing, it was expected that the SIM aid would perform at least as well as the OWN aid 524 and usually better. The protocol therefore called for the use of either the same, or lower, SBR as used in 525 the OWN condition. Consequently, 10 out of 19 participants were tested at a lower SBR in the SIM 526 condition, but with only a 1 or 2 dB difference.

527
In order to compare the data from the two HA systems at an equal SBR, the scores in the SIM  Of the 19 points of the data set, only one point lies below the diagonal line of performance equality. The 546 mean score for the OWN aid was 61.5%, while the mean corrected score for the SIM aid was 71.1%. The 547 mean difference in scores was 9.6%, with an SD of 6.7%. However, the difference data were not 548 normally distributed, so a Wilcoxon signed-rank test was performed, which revealed a significant 549 difference, z(18) = 8, p < 0.0001. Using the normalised P-I function with the slope defined by β, the 550 mean difference is equivalent to an SBR benefit of 1.0 dB in the SIM compared to the OWN condition.

551
A comparison of the difference scores, (PH -PL), between the SIM and the OWN conditions 552 showed a mean difference of 1.8%, with an SD of 7.7%, which was not significant t(18) = 1.00. The 1.0 553 dB performance benefit of SIM over OWN therefore seems to be due to better access to all modulation 554 rates rather than to just the high-rate cues. 555 556

557
Despite the experiment being intended to measure individual performance, the individual data shown in 558 Figs. 3 and 4 were 'noisy', making some results hard to interpret. Since PH and PL in Eqn. (2, PH, PL, sw) 559 represent asymptotic values that were well sampled in the data, there was a risk that β and MP largely co-560 varied so as to minimise the fitting error. We therefore created a perceptually relevant measure of how 561 far below channel RMS level the participants were able to extract high-rate envelope information, which, 562 at the same time, linked β and MP. This 'Valley' measure, V10, was defined as the value of switching 563 threshold at which a 10% decrease in the relative change between PH and PL had been reached, i.e., the 564 introduction of low-rate cues in the valleys was just starting to reduce intelligibility. Across participants, 565 the range of (PH -PL) varied between 13.9 and 57.2 %. Defining Values for V10 are given in Table IV for all participants. 574 Data from three HA participants (H10, H12 and H18) were removed from this analysis. From 575 Fig. 3 it is seen that these participants had at least one trace that did not reach an asymptotic value of PH, 576 even by the lowest switching threshold tested. In previous work (Fig. 2), we observed asymptotic 577 performance at about -15 dB relative to the channel RMS, and the associated V10 would be several dB

586
The Pearson correlation coefficient for these data was r = 0.601, 14 df, (p < 0.02 two-tailed). This 587 significant correlation implies that the participants were performing in a similar fashion between 588 conditions. However, there was no significant difference between the V10 measures for the OWN and 589 SIM conditions (mean = 1.59, SD = 4.64 dB, t(15) = 1.38, NS). We interpret this to mean that the 590 similarity in performance was due to factors related to the participant and, disappointingly, not the change 591 in processing between SIM and OWN conditions in permitting perceptual access to the valleys of the 592 channel envelopes. 593 594

The possible influence of participant-related factors from the HA data set
Partial correlations were performed between Test SBR and a number of variables, while controlling for 596 the score on the digit span test. The aim of this analysis was to establish the extent to which cognitive 597 factors may mask some interesting relations. The different partial correlations assessed whether Test SBR 598 was related to PH, access to high-rate envelope cues (PH -PL), mean low-frequency audiogram (averaged 599 over 250 and 500 Hz), mean high-frequency audiogram (averaged over 2, 3 and 4 kHz), and difference 600 between the low-and high-frequency mean audiogram (a measure of audiogram slope). These revealed a 601 correlation between Test SBR and (PH -PL) (OWN aid, r = -0.530, 16 df, p = 0.024; SIM aid, r = -0.601, 602 16 df, p = 0.008, uncorrected) indicating that participants tested at a high SBR received less benefit from 603 the high-rate envelope cues. We will return to this in the Discussion.

604
In the same set of partial correlations, a correlation was observed between PH, and (PH -PL) 605 (OWN, r = 0.597, 16 df, p = 0.009; SIM, n.s.). This hints that achieving a high difference score was 606 limited by the starting value of PH, and that, with a low starting value, a possible floor effect was 607 introduced in the all-L condition, despite the goal of adjusting the test SBR so that the all-H condition 608 achieved a score fairly high on the psychometric function. In practice this goal was not always met ; eg 609 H15 had a PH of 28% in the OWN condition (as seen in Table IV  10.9% respectively. Using a two-tailed, unpaired t-test, the mean difference in (PH -PL) between the HA 625 (OWN) and CI groups was 9.0% (standard deviation, SD, 3.5%), giving t(35) =2.55, p=0.015. CI users 626 do not appear to be as able as the HA users to make use of high-rate modulation information. We will 627 qualify this interpretation later.

632
The same V10 measure was generated from the CI data set using Eqn. (3). Data from participants 633 C2 and C11 were excluded because their V10 measure (-28.6 and -18.2 dB, respectively) was less than -15 634 dB, and likely erratic. Figure 8 shows the histograms of the V10 measures for the three hearing prostheses 635 (OWN, SIM and CI). The pairs of mean, (SD) in dB for each device were (i) OWN : -5.9, (4.9) with 17 636 participants, SIM -6.6, (5.5) with 18 participants, and CI -3.6, (5.7) with 16 participants. Pooling the data 637 for the OWN and SIM conditions due to the non-significant difference reported in III.C.2 above, the 638 difference between the HA and CI conditions gave a value of t(27) = -1.59, p = 0.10, also non-significant.

659
The perceptual cost in accessing envelope modulations by using a clinically-fitted non-linear hearing aid 660 compared to a high-quality simulated linear hearing aid was measured as being 1 dB in SBR. This is 661 similar to the disadvantage found for discriminating co-located speech-in-speech masking when 662 comparing binaural linear HAs against unaided listening (Cubick et al. 2018). Although these individual 663 costs, 1dB, appear small, because they come from differing aspects of the acoustic scenario (non-linear 664 and binaural respectively), they have the potential to add up to a more significant disadvantage, especially 665 if there are similar small disadvantages associated with other changes in the acoustic scenario.

666
Encouragingly, the benefit to intelligibility from high-rate modulation cues to both HA and CI 667 users was similar on some measures, such as the perceptually relevant dynamic range over which these 668 cues were available, but differed on others, such as the gain in intelligibility possible from these cues. In 669 HA users, this gain in intelligibility was very similar to that previously reported by Stone et al. (2012), 670 around 36%, but much less in CI users, around 26%. As with many studies, these interpretation needs to 671 be qualified.

694
However, the fact that this was observed while the SBR was still positive, compared to previously at 695 negative SBRs, is probably related to our use of a speech, rather than a steady masker. The speech 696 masker has a wider distribution of short-term (10-ms or greater) levels than a continuous masker. The 697 peaks of the speech masker extend up to 11 dB above the mean level (1-% exceedance level, the level exceeded by the signal for only 1 % of the time, Moore et al., 2008) and so can reach levels sufficient to 699 interfere with the target speech in its mid-range levels while at positive SBRs.

700
Using this finding (observed in the HA users), it is therefore possible that the benefit to 701 intelligibility reported here by the CI users from the high-rate envelope cues may have been under-702 estimated since CI users were, on average, tested at higher SBRs than those used for the HA users. A 703 more nuanced comparison of test SBRs between the two groups is left to Section IV.B below. previously noted, some of this difference was driven by a small number of the older CI participants tested 710 at SBRs exceeding +15 dB. Excluding these participants altered the CI Test SBR mean and (SD) to 9.6 711 and (2.1) dB, reducing the difference to 1.7 dB, (t(28) = 2.45, p = 0.02). Despite the difference in Test 712 SBR, and apart from the difference in amount of benefit of access to high-rate envelope cues (Fig. 7), 713 there was a lack of other differences between other measures of HA and CI performance, such as the 714 across-group scatter in benefit of high-rate cues, and the similarity of the V10 measure. This suggests that, 715 on average, the signal processing in CIs is leading to a perceptual performance using envelope 716 modulations that is similar to the delivery of modulation information via an acoustic hearing aid. The 717 perceptual cost of the simplification of modulation information for electrical stimulation does not render 718 too much modulation information inaccessible (as inferred from the loss of about 1.7dB SBR). It should 719 be noted, however, that the CI participants were pre-selected to be at the better end of performance on 720 clinical tests of speech intelligibility (greater than 70% word intelligibility at an SBR of +10 dB for BKB 721 sentences presented in speech-spectrum shaped noise).

726
Conversely, there are a few participants whose data also exhibit this step change, but are poorly fitted to 727 the data (H11 SIM, H12 SIM, H15 OWN, H15 SIM, and C1, C3, C10, C14, C15, C17). It is noteworthy 728 that this degree of heterogeneity in response is similar across both groups. One possible explanation is 729 variation in attention or fatigue during the test. Each condition was tested with 30 sentences, and took at

748
(2) Further interleaving of test conditions to overcome short-term variations in possible fatigue.

754
The ability to benefit from high-rate envelope modulations (> 16 Hz) in a two-talker separation 755 task using tone-carrier vocoded processing was explored as a function of depth in the channel envelopes 756 at which the high-rate information was made available. The Signal-to-Background Ratio (SBR) was 757 adjusted for each participant in order to set best performance to about 70% so that the effect of processing 758 was measured on the steepest part of the Performance-Intensity functions.

759
For HA participants, the 'cost' of a clinically fitted non-linear hearing aid over a simulated linear 760 aid with the same insertion gain in accessing these higher-rate modulations was estimated as a 1.0 dB loss 761 of SBR (for a competing talker background). The dynamic range of modulations made accessible by the 762 processing did not seem to differ between the clinical and the simulated aids. There was no evidence of a 763 distortion of high-rate cues by the OWN aid over and above the generally poorer performance across all 764 rates of modulation.

765
The finding of a negative correlation between the Test SBR and the degree of benefit obtained 766 from the high-rate modulations by the HA participants appears to be another example where the statistics 767 of speech perception (as measured by modulation-based intensity importance functions, IIFs), and not just 768 hearing ability, influence results. This is similar to the findings of Bernstein and Grant (2009) who 769 explained their results in terms of full-audio IIFs. The similarities between the two unrelated experiments 770 re-emphasises the need to control for Test SBR when comparing results across participants. 771 CI participants were much less able than HA participants to make use of high-rate envelope cues.

772
Apart from one star performer, C15, CI participants were generally tested at higher SBRs. The 773 demonstration among HA participants, that Test SBR influences the degree of benefit obtained, suggest 774 that even the lower degree of benefit from the high-rate cues among the CI participants could partly be due to speech statistics rather than the underlying deficits caused by the more severe hearing losses that 776 are a pre-requisite for cochlear implantation.

777
A further note of optimism for the CI population was that the dynamic range over which they 778 could access high-rate modulation cues was very similar to that of the HA population. Despite the highly 779 synthetic signal delivered via a CI, and uncertainties about matching stimulation to the non-linearities at 780 the neural interface, it is encouraging to see that CI processing is able to enable a functionally similar 781 access to speech cues. 782 783