Improving hearing-aid gains based on automatic speech recognition

This study provides proof of concept that automatic speech recognition (ASR) can be used to improve hearing aid (HA) fitting. A signal-processing chain consisting of a HA simulator, a hearing-loss simulator, and an ASR system normalizing the intensity of input signals was used to find HA-gain functions yielding the highest ASR intelligibility scores for individual audiometric profiles of 24 listeners with age-related hearing loss. Significantly higher aided speech intelligibility scores and subjective ratings of speech pleasantness were observed when the participants were fitted with ASR-established gains than when fitted with the gains recommended by the CAM2 fitting rule. VC 2020 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.1121/10.0001866 [Editor: Douglas D. O’Shaughnessy] Pages: EL227–EL233 Received: 14 March 2020 Accepted: 13 August 2020 Published Online: 8 September 2020


Introduction
Hearing-aid (HA) fitting is typically a two-stage process during which an initial gain fitting is applied according to a proprietary or device-independent method, followed by behavioral tests with the patient to fine-tune the fitting with the aim of maximizing speech intelligibility and listening comfort.
Fitting methods usually prescribe frequency-dependent amplification on the basis of the patient's audiometric thresholds. These target gains generally represent a compromise between providing enough amplification to restore audibility and taking into account the patient's reduced dynamic range in order to avoid presenting sounds at uncomfortable levels. Some fitting methods, such as DSL V5 (Scollie et al., 2005), are based on a loudness-normalization (LN) rationale, aimed at restoring the loudness perception of the patient to that of a normal-hearing listener. Other methods, such as NAL-NL2 (Keidser et al., 2011) and CAM2 (initially referred to as CAMEQ2-HF; Moore et al., 2010b), are based on a loudness-equalization (LE) rationale, aimed at equalizing loudness across frequency bands while maintaining the overall loudness at a comfortable level. While LN-based methods aim at maximizing sound quality and comfort, LEbased methods generally favor speech intelligibility. For example, CAM2 takes into account the general shape of the long-term average speech spectrum (Moore et al., 2008) and defines gains so that, for a given overall loudness, as much as possible of the speech signal falls above the listener's absolute hearing thresholds. In contrast, the NAL-NL2 gain-prescription rule was defined by using an adaptive computer-controlled process to find optimal gains in terms of loudness and speech intelligibility, as measured by the Speech Intelligibility Index (ANSI, 1997).
Even though initial fittings based on these gain-prescription rules lead to satisfactory outcomes (e.g., Moore and F€ ullgrabe, 2010;Valente et al., 2018), further fine-tuning is generally required to achieve best results for a given HA user (Moore, 2007). This often involves the use of speech intelligibility tests, which can be time-consuming and, thus, make it impossible to evaluate a) Author to whom correspondence should be addressed, ORCID: 0000-0001-7895-8567. b) Also at: Service d'Oto-Rhino-Laryngologie, d'Oto-Neurologie et d'ORL P ediatrique, Centre Hospitalier Universitaire de Toulouse, 31059 Toulouse, France c) ORCID: 0000-0003-2252-9258. d) ORCID: 0000-0001-9127-8136. many different HA settings (e.g., gain functions) to find their optimal combination in terms of speech intelligibility and listening comfort.
Automatic speech recognition (ASR) techniques represent a novel approach that, without being limited by the patient's time, motivational and attentional levels, and prior knowledge of the speech material, allows testing of a large number of combinations of HA gains in order to find the gain function yielding the highest speech intelligibility for a given audiometric profile. Moreover, contrary to other models used to evaluate HA settings, 1 ASR does not require taskspecific reference signals or conditions [see Kollmeier et al. (2016) for a more detailed discussion].
Recent studies demonstrated that ASR can be used to predict trends in speech intelligibility in listeners with age-related hearing loss (ARHL; Fontan et al., 2020;Kollmeier et al., 2016;Sch€ adler et al., 2018). For example, Fontan et al. (2020) used a speech-intelligibility prediction system consisting of a signal-processing algorithm to simulate the effects of ARHL and an ASR system to predict the unaided identification performance of 24 listeners with ARHL for different speech materials (logatoms, words, and sentences). Based on each listener's audiogram, the speech materials were processed to simulate three perceptual consequences of ARHL (elevation of hearing thresholds, loss of frequency selectivity, and loudness recruitment), and the ASR system was used to compute intelligibility scores. Strong to very strong correlations were observed between human and ASR speech-intelligibility scores.
The aim of the present study was to explore if the same speech-intelligibility prediction system could be used to improve HA gains for individual listeners. This was done by first generating for each listener a number of alternative gain functions based on the gain fitting rule CAM2 (i.e., variations around the CAM2 prescription), amplifying speech tokens using those functions, and determining their intelligibility using ASR. Note that the ASR system normalized the intensity of the input signals as a function of their maximum energy, with the energy being defined as the sum of the square magnitude of the signal contained in the 16-ms analysis windows used by the ASR system. Consequently, ASR performance was not affected by the changes in overall intensity that occur when switching between different gain functions. The gain function yielding the highest intelligibility score (henceforth referred to as "OPRA" gains, where OPRA stands for "objective prescription rule based on ASR") and CAM2 gains were then used to produce amplified speech tokens that were presented to participants with ARHL for identification. Subjective ratings of speech pleasantness were also collected to verify that potential improvements in terms of speech intelligibility would not be obtained at the expense of reduced perceived quality of the speech signals.

Participants
Twenty-four (10 female) right-handed, older (age range: 63-82 yr; mean age ¼ 72.3 yr; standard deviation, SD ¼ 5.5) native French speakers were recruited from a HA dispensing center in Montauban (France). Participants were selected for (i) being experienced HA users, 2 having been fitted (for the most part according to the NAL-NL2 fitting rule; Keidser et al., 2011) in the test (i.e., right) ear for at least 7 months (on average 3 yr, 1 month), and wearing their HAs for at least 5 h per day (on average 10.2 h according to data logging); (ii) having symmetrical ( 10 dB interaural difference in PTA for frequencies of 0.5, 1, 2, and 4 kHz) sloping high-frequency mildto-moderate hearing losses, not exceeding 75 dB hearing level (HL) at 4 kHz in the test ear; and (iii) having air-bone gaps 10 dB at 0.5, 1, 2, and 4 kHz in the test ear. Air-and boneconduction audiometry was conducted with calibrated standard audiometric equipment in a sound-treated room in the HA dispensing center. Air-conduction pure-tone audiograms are shown in Fig. 1.
None of the participants reported any neurological disorder, and all passed the Cognitive Disorders Examination (CODEX), a validated short cognitive screen for dementia (Belmin et al., 2007). Prior to participation in the 1-h test session, each participant provided informed written consent. The study was approved by the ethical committee of the Honor e Cave Hospital (Montauban, France).

Stimuli and procedure
Speech intelligibility was assessed in quiet for the two speech materials most frequently used by HA audiologists in France for speech audiometry (Rembaud et al., 2017): (i) Words consisting of disyllabic masculine nouns, each preceded by the French masculine definite article (e.g., "le soldat" -"the soldier"; Fournier, 1951).
(ii) Sentences taken from the French version of the Hearing in Noise Test (HINT; Vaillancourt et al., 2005) using simple words and a single assertive clause (e.g., "Le camion est rouge." -"The truck is red.").
Stimuli were calibrated recordings distributed by the Collège National d'Audioprothèse (2007). They were spoken by a single adult male native French speaker and recorded using a 44.1-kHz sampling rate and 32-bit quantization. They were presented to the participant seated in the same sound-treated room as used for the audiometric assessment, using a laptop running a custom presentation program written in Python, a Presonus Audiobox 44 VSL external sound card (Baton Rouge, LA), and the right earpiece of Sennheiser HD650 headphones (Wedemark, Germany).
In the "unaided" condition, stimuli were presented unprocessed at 60 dB sound pressure level (SPL).
In the CAM2 condition, based on each participant's audiogram and assuming an input level of 60 dB SPL, the gain for 12 center frequencies was computed using the CAM2B-v2 fitting software (Cambridge Enterprise, University of Cambridge, UK). Using a software loudness model of impaired hearing, the algorithm aims to produce a flattened specific loudness pattern over the frequency range from 0.5 to 4 kHz, which is the most important for speech perception (ANSI, 1997). For a detailed description, see Moore et al. (2010b). The gains were implemented in a HA simulator (Moore et al., 2010a), using five processing channels equally spaced on the ERB-number scale (Glasberg and Moore, 1990). The 12-point insertion gain response was implemented as a Finite Impulse Response filter before separation into channels. Two dynamic range compressors were implemented in series in each channel. The first performed the bulk of the prescribed wide dynamic range compression (WDRC) function, while the second was configured as a fast-acting limiter whose compression threshold tracked at a fixed offset above the running mean level measured in the first compressor. The limiter function was therefore only activated by occasional peaks (< 1% of the time) in the channel speech signal. The attack times were 200, 100, 100, 100, and 100 ms, while the release times were 2000, 1500, 1200, 1000, and 1000 ms for channels 1 to 5, respectively. The channel compression thresholds were 47, 44, 40, 35, and 45 dB SPL, respectively. For an input level of 60 dB SPL and speech in quiet, channel 5 would be activated mainly by the peaks within the channel. A small error in the programming of the HA insertion gain meant that the insertion gains used were suitable for a 55 dB SPL input. This meant that the actual replay level per channel was up to 3.3 dB higher than prescribed, depending on the channel compression ratio.
In the OPRA condition, for each participant, the 12 frequency-specific gains prescribed by CAM2 were split into four frequency ranges (0.125-0.5 kHz, 0.75-1.5 kHz, 2-4 kHz, 6-10 kHz). Within each frequency range, the CAM2 gains were systematically varied by 0, 63, or 66 dB and applied to each of the three frequencies in a given frequency range, implemented with smoothed transitions between step changes in gain. This resulted in 625 (¼5 4 ) gain functions, including the original CAM2 gain function. The compression ratios were identical for all gain functions and corresponded to those used in the CAM2 condition. Each of the gain functions was implemented in the HA simulator to amplify 50 dissyllabic words (Fournier, 1951) that were not used in the identification task performed by the participants. The processed speech was then used as input to a hearing-loss simulator implemented in MATLAB, mimicking some of the perceptual consequences of ARHL (Nejime and Moore, 1997). In the present study, only loss of audibility and loudness recruitment were simulated since the simulation of loss of frequency selectivity (which is also typically associated with hearing loss and leads to the "smearing" of spectral components) has been shown to lead to weaker associations between human and machine intelligibility (see Table III in Fontan et al., 2020). The severity of the degradation imposed by the hearing-loss simulator was dependent on the participant's audiometric thresholds. The amplified and then degraded stimuli were finally fed into an ASR system run on the OSIRIM platform (http://osirim.irit.fr/site/en), which is a cluster of 928 central processing units and 28 graphical processing units. The acoustic and language models used and the general functioning of the ASR system are described in Fontan et al. (2020). The ASR system used a lexicon consisting of 6491 dissyllabic words [corresponding to the "large" lexicon described in Fontan et al. (2020)]. The range of ASR performance (i.e., maximum word recognition rate to minimum word recognition rate) for all 625 gain functions tested for each participant was, on average, 33.7 percentage points (SD ¼ 14.9). In each of the four frequency ranges used, increases in ASR performance were observed for both positive (i.e., þ3 or þ6 dB) and negative (i.e., -3 or -6 dB) gain changes. As expected, given the use of the intensity-normalization process in the ASR system, all uniform variations of CAM2 gain functions (i.e., -3, þ3, -6, or þ6 dB at all frequencies) yielded the same ASR performance as that obtained with the original CAM2 gain functions.
The amplified speech obtained using the gain function yielding the highest machine intelligibility was then presented to the older hearing-impaired (OHI) participants for identification. Figure 2 illustrates the absolute and relative differences between OPRA and CAM2 gains. Compared to CAM2 gains, OPRA tended to provide more amplification for frequencies up to and including 4 kHz (þ5.0 dB for frequencies 0.5 kHz, and þ2.9 dB for frequencies > 0.5 kHz and 4 kHz), while providing less amplification at higher frequencies (-3.4 dB for frequencies > 4 kHz).
Participants were tested in all three listening conditions (unaided, CAM2, and OPRA), using both sets of speech materials. For each set of speech material, participants first completed the test in the unaided condition and then in the two aided conditions. The participants were pseudo-randomly allocated to one of the four possible orders (2 aided conditions Â 2 speech materials) to obtain the same number of participants (N ¼ 6) for each order. Prior to data collection in each of the test conditions, practice was provided in the form of 10 words or 10 sentences that were not used in the test phase.
Participants reported back verbally the words they had heard and were instructed to guess in case they were uncertain. No feedback was provided. For both sets of speech materials, words correctly identified were scored manually by the experimenter. Word intelligibility in each listening condition was computed as the average identification performance for 50 words, corresponding to five of the 10-word lists (Lists 1-5, 7-11, or 13-17) developed by Fournier (1951). This number of words has been shown to yield satisfactory reproducibility of behavioural intelligibility outcomes (Moulin et al., 2016). Sentence intelligibility in each listening condition was computed as the average word-intelligibility score for one of the 20-sentence HINT lists ( List 1, 2, or 3;Vaillancourt et al., 2005). Lists were counterbalanced across listening conditions. At the end of each aided listening task, the participants were asked to rate the pleasantness of the speech tokens on a 11-point Likert scale, ranging from 0 ("Very unpleasant/artificial") to 10 ("Very pleasant/natural").

Results
The top panel of Fig. 3 shows intelligibility scores for words and sentences for each of the three listening conditions. In the unaided condition, word-intelligibility scores were widely distributed, ranging from 0 to 92% (median: 46%). Unaided sentence-intelligibility scores were higher (median: 68%), and their distribution was bimodal, with five participants scoring below 10% and the others scoring above 38% (maximum: 96.9%).
For both sets of speech materials, intelligibility increased markedly with HA amplification, leading to ceiling effects in the two aided conditions. Median scores were marginally higher for OPRA gains than for CAM2 gains: 97% vs 96%, and 99.5% vs 97.1% for words and sentences, respectively. According to Wilcoxon signed-rank tests, the difference between gain prescriptions was significant for words (Z ¼ 217, p ¼ 0.003; two-tailed) and sentences (Z ¼ 247, p ¼ 0.001; two-tailed).
The bottom panel of Fig. 3 shows the subjective ratings of speech pleasantness for words and sentences in the two aided conditions. Ratings were lower in the CAM2 condition (median: 7.0 for both words and sentences) than in the OPRA condition (median: 8.3 and 8.5 for words and sentences, respectively). According to Wilcoxon signed-rank tests, this difference was significant for words (Z ¼ 220, p < 0.001; two-tailed) and sentences (Z ¼ 221, p ¼ 0.002; two-tailed).

Summary and discussion
This study provides evidence that a reference-free ASR system can be used to improve individually prescribed HA gains. Based on the audiograms of 24 OHI participants, two of the perceptual consequences of ARHL (elevation of hearing thresholds and loudness recruitment) were simulated, and an HA simulator was used to find the HA gain function yielding highest ASR performance for each participant. A possible concern was that any increase in the level at the output of the HA simulation would result in better ASR performance, by compensating for the "noise" induced by the ARHL simulation and thus improving the "signal-to-noise ratio." This would have led to the maximization of the amplification gains (i.e., þ6 dB at all frequencies) for each participant. However, uniform gain variations yielded the same outcome as the original CAM2 gain functions, showing that the ASR performance was not directly affected by the presentation level at the output of the HA simulation. Indeed, as the ASR system normalized the energy of the input signals, its performance was only affected by changes in the relative distribution of the signal energy as a function of frequencies, which impacts the probability for speech phones to be correctly identified. In this study, such spectral distortions included the effects of the ARHL simulation, as well as non-uniform changes in the HA gain functions aiming at (partially) Fig. 3. (Color online) Speech intelligibility scores (top panel) and subjective judgments of speech pleasantness (bottom panel) for word and sentence materials (left and right side of the figure, respectively), using no amplification ("Unaided"), and the gains prescribed by CAM2 and OPRA. The nominal input level was 60 dB SPL. Individual and median scores are shown by the circles (with overlapping data points displaced laterally for better visibility) and the thick horizontal lines, respectively. The grey areas represent inter-quartile ranges. compensating these effects. As a result, the optimal variations of CAM2 gains found for each participant implied both gain increases and gain decreases in each of the four frequency ranges used in this study (i.e.,(2)(3)(4)(6)(7)(8)(9)(10). Significant difference in ASR performance was observed between OPRA gains and the gains prescribed by the CAM2 fitting rule (mean CAM2 : 30.0%; mean opra : 43.3%; t [23] ¼ 10.5, p < 0.001; two-tailed).
OPRA gains yielded significantly higher human word-and sentence-intelligibility scores than the gains recommended by the CAM2 fitting rule. Even though these differences were statistically significant, they were small, possibly due to near-perfect performance of most participants in both aided conditions. In future studies, ceiling effects in human performance could be avoided by using a lower presentation level, which, however, would be less representative of standard conversational levels. Employing an adaptive procedure to track speech recognition thresholds in quiet is not an option, given the use of signal-intensity normalization techniques in ASR systems, making them insensitive to the overall level of the speech signal. Another possible way to avoid ceiling effects is to use linguistically more challenging speech materials than those used in the present study, or to present the target speech in interfering background noise. The applicability of ASR to the prediction of speech-in-noise intelligibility is currently under investigation.
While the statistically significant improvements in speech intelligibility with OPRA gains are probably not clinically significant due to their small size, the simultaneously obtained pleasantness judgements revealed markedly higher preference scores for speech amplified according to OPRA. This can be explained by the fact that, compared to CAM2, OPRA called for less amplification at high (> 4 kHz) and more amplification at low ( 4 kHz) frequencies, which is known to lead to higher judgements of pleasantness (e.g., Moore et al., 2011).
It should be noted that in the current study the speech material was presented in quiet at a single level. Hence, it remains to be shown whether OPRA benefits would also be observed for other presentation levels for speech in quiet and for speech in noise.
Also, this preliminary study was limited to the comparison of CAM2 and OPRA gains. Further work is warranted to compare OPRA outcomes-in terms of both machine and human speech intelligibility-to the outcomes obtained with other speech-intelligibility prediction systems, which are often less complex, such as the Speech Intelligibility Index used to define NAL-NL2 prescription gains (Keidser et al., 2011).
It is possible that ASR-based fitting could be further improved by refining the gain prescription, using narrower frequency ranges, and varying the compression ratio in each of these ranges. The present study aimed to avoid excessive computational costs by testing a single set of fixed compression ratios and by splitting the 12 frequency-specific CAM2 gains into four frequency ranges that were systematically varied to determine OPRA gains. Increasing processing power and/or processing optimization would allow for the use of a higher number of frequency ranges and compression settings (i.e., compression ratios, attack and release times of the compressor), as well as longer speech items (ideally produced by different speakers of both genders), which might improve the certainty of OPRA settings and the generalizability of the associated behavioural benefits. These benefits might be further increased by using other (e.g., more physiologically inspired) models of human peripheral auditory processing in the HL simulator and/or as the front end of the ASR system itself (e.g., Holmberg et al., 2007). Finally, the effects of normalizing input signals based on their loudness instead of their maximum energy should be explored.
In conclusion, the results obtained in this study provide proof of concept that ASR techniques can be used to improve, at no "experimental cost" to the HA user, the gains recommended by an established fitting rule, such as CAM2. Promisingly, this approach has led to the improvement of both speech intelligibility and speech pleasantness.