Reliability and critical differences for an implementation of the coordinate response measure in speech-shaped noise

This study established test-retest reliability and critical differences for an implementation of the coordinate response measure (CRM) for the purpose of detecting significant changes in task performance. In normal-hearing adults, speech stimuli were presented monaurally at 50 dB sound pressure level in speech-shaped noise at signal-to-noise ratios (SNRs) of –12, –9, and –6 dB. Two runs were obtained. Intrasubject and intersubject variability were examined. Performance increased significantly with increasing SNR and in the second run. High variability was observed at each SNR. Critical differences indicated that only large changes in performance would be significant for the CRM as implemented in this study. VC 2021 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). [Editor: Martin Cooke] https://doi.org/10.1121/10.0003050 Received: 11 September 2020 Accepted: 18 November 2020 Published Online: 13 January 2021


Introduction
The coordinate response measure (CRM; Bolia et al., 2000) is a closed-set speech discrimination task. It consists of a corpus of sentences spoken by male and female adult speakers of American English. Each sentence consists of a call sign followed by one of 32 color-number combinations (e.g., "Ready Charlie go to blue one now"). The listener's task is to correctly identify the color-number combination produced by a target talker, typically in the presence of noise. The task has been utilized for applications including speech discrimination in the presence of competing talkers (Brungart, 2001) and spatial release from masking (Jakien et al., 2017).
The current study was motivated by our recent work (Mertes et al., 2019) that utilized the CRM to investigate how activation of the contralateral medial olivocochlear (MOC) reflex contributes to speech-in-noise performance (for a recent review of the MOC reflex, see Lopez-Poveda, 2018). We utilized the CRM due to its low chance performance and limited contextual cues. Under the hypothesis that contralateral MOC reflex activation improves speech-in-noise understanding, we predicted that scores on the CRM would be significantly higher in the presence of a contralateral MOC reflex elicitor than in the absence of an elicitor. Some participants showed higher scores in the presence of the elicitor, but we did not establish test-retest reliability and critical differences so we could not determine if these changes in score represented true changes due to contralateral MOC reflex activity or if the changes fell within test-retest variability. The testretest reliability and critical differences are also relevant for applications such as detecting significant decreases in score due to progression of hearing loss or detecting significant increases in score due to an intervention.
The purpose of the current study was to establish test-retest reliability and critical differences for the CRM using a similar implementation as our recent investigation (Mertes et al., 2019). The current study was considered descriptive in nature. The results will guide our future studies that examine the contribution of the MOC reflex to hearing in noise. The results will also serve as a reference for researchers interested in repeated measurements of the CRM.

Participants
The research protocol was approved by the Institutional Review Board of the University of Illinois at Urbana-Champaign. Participants were recruited from the University of Illinois campus using flyers, in-class announcements, and word-ofmouth advertising. Inclusion criteria consisted of the following: 18 to 30 yr old, English as a first language, normal or a) ORCID: 0000-0002-8754-2122. corrected-to-normal vision, unremarkable otologic history (no hearing loss, hearing asymmetry, family history of permanent childhood hearing loss, noise exposure within the past six months, tinnitus, vertigo, use of ototoxic medication, otalgia, otorrhea, or aural fullness), and passing a pure-tone air-conduction screening at 20 dB hearing level (HL) in both ears for octave frequencies 250-8000 Hz and interoctave frequencies of 3000 and 6000 Hz. Written informed consent was obtained from all participants prior to enrollment. Study visits lasted approximately one hour. Participants received either monetary compensation or extra credit for an approved university course.
A total of 34 participants were enrolled. Two participants did not pass the pure-tone screening, and the first two eligible participants completed fewer trials per block than the remaining participants due to a change in the experimental protocol (described in Sec. 2.3), so their data were excluded. Therefore, data from 30 participants were included in the final analysis (mean age 6 SD ¼ 21.1 6 1.4 yr, 29 females). 1

Stimuli
The CRM corpus was stored on a PC as .wav files with a sampling rate of 44 100 Hz and a bit depth of 16 bits per sample. To be consistent with our previous work (Mertes et al., 2019), we utilized the first male talker of the corpus (talker 0) and the call sign "Charlie" for each sentence. Stimuli were delivered to right ears using HD 600 circumaural headphones (Sennheiser) interfacing with a Windows-based PC, custom code written in MATLAB (The MathWorks, Inc.), and a Babyface Pro USB audio interface (RME). It should be noted that our previous study (Mertes et al., 2019) used insert earphones. Each sentence waveform had the same root-mean-square amplitude. Levels were calibrated to 50 dB sound pressure level (SPL) by presenting the concatenated sentences with no pauses through the headphones while coupled to a flat plate coupler assembly and a System 824 sound level meter (Larson Davis, Inc.).
Background noise consisted of the commercially available "Four Talker Noise" recording of one male and three female talkers speaking continuously (Auditec, Inc.). This noise stimulus has been utilized in a previous investigation of the CRM (Eddins and Liu, 2012) but differs from the noise used in Mertes et al. (2019). To eliminate informational masking for our investigations of the MOC reflex, we transformed the noise to speech-shaped noise by computing a fast Fourier transform (FFT) of the waveform, randomizing the phase values, and computing an inverse FFT. The amplitude of the noise waveform was scaled digitally to yield SNRs of -12, -9, and -6 dB. These SNRs were selected to avoid floor and ceiling effects based on preliminary testing conducted in six normal-hearing laboratory members.
During testing, the noise was presented for 500 ms prior to the onset of the speech, which in our previous investigation (Mertes et al., 2019) was intended to allow for the onset of the MOC reflex (Backus and Guinan, 2006). At the end of the sentence, the noise was turned off after an additional 500 ms, which in our previous investigation was intended to allow for the offset of the MOC reflex (Backus and Guinan, 2006). During each sentence presentation, a random segment of the full noise waveform was presented, which reduced potential perceptual learning of the noise (Felty et al., 2009).

Experimental procedure
For this study, trial refers to the presentation of the carrier phrase "Ready Charlie go to" followed by a color-number combination that was randomly selected with replacement as in Brungart (2001). A trial was counted as correct when the participant selected the correct color-number combination. Block refers to a series of 20 trials. There were two blocks for each SNR. Run refers to a set of six blocks (3 SNRs Â 2 blocks). Within a run, the order of blocks was randomized for each participant. For each SNR, the participant's performance was pooled across the two blocks (40 trials). For each participant, one run yielded three scores, one for each SNR. Scores were computed as percentage correct then transformed to rationalized arcsine units (RAU; Studebaker, 1985). This transform was performed to meet the assumption of homogeneity of variance for a repeated-measures analysis of variance (ANOVA).
All testing was conducted in a single-walled sound-treated booth while the participant was seated at a desk. Participants first underwent a practice session to familiarize themselves with the CRM task. Brief verbal instructions were provided by the experimenter, followed by on-screen instructions. Participants were encouraged to guess whenever they were unsure of the correct answer.
The practice consisted of 20 trials. The first 5 trials were in quiet, followed by 5 trials each at 0 dB SNR, -6 dB SNR, and -12 dB SNR (i.e., in order of increasing difficulty). After each trial, participants selected the color-number combination they heard using a touch screen monitor displaying a graphical user interface. Participants were required to obtain at least nine correct on the first 10 trials before proceeding to the remainder of the trials, otherwise the practice started over. During the practice and the experimental sessions, participants were provided with on-screen text after each trial that indicated whether the response was correct or incorrect.
After the practice session, the experimental session began with an initial run ("run 1"). After run 1, the headphones were removed, and participants took a mandatory five-minute break. Participants then underwent a second run ("run 2") that was identical to run 1, except that the order of blocks was randomized again and the color-number combination at each trial was randomly selected with replacement.

Analysis
Statistical analyses were conducted using the MATLAB Statistics and Machine Learning Toolbox ver. 2019b (The MathWorks, Inc.), SPSS ver. 26.0.0.0 (IBM Corporation), and SAS ver. 9.4 (SAS Institute Inc.). An a value of 0.05 was selected for all statistical tests, with correction for multiple comparisons.
A two-way repeated-measures ANOVA was run to determine the effect of the factors of SNR (-12, -9, and -6 dB) and run (1 and 2) on score. Critical differences were computed to determine the minimum difference between two scores in an individual that can be attributed to a true change (e.g., due to MOC reflex activation) rather than measurement variability. Following the methods described in Xu and Cox (2014), 95% critical differences were obtained by multiplying the standard deviation of the difference scores (i.e., scores in run 2 minus run 1 collapsed across SNR) by 1.96.
Test-retest reliability was assessed through Bland-Altman plots (Bland and Altman, 1986). Bland-Altman plots display the difference between two scores in an individual (run 1 minus run 2) against the average of the two scores. The mean difference scores across participants represent the bias or systematic deviation between measurements. A negative bias value indicates that scores tended to be higher in run 2 than run 1. When the 95% confidence interval (CI) for the bias does not contain zero, it indicates statistically significant bias. The Bland-Altman plots also display the 95% limits of agreement as the mean difference 61.96 standard deviations. The limits of agreement demarcate where 95% of the data points are expected to fall.

Results
Mean scores across participants are plotted at each SNR and run in Fig. 1(A). Floor and ceiling effects (scores of -15.71 and 115.71 RAU, respectively) were not present, as intended. The assumptions of a two-way repeated measures ANOVA were verified. No outliers were present as indicated by all studentized residuals falling within 63. The assumption of normality was met as assessed by Shapiro-Wilk tests of normality (p > 0.05 in all cases). The assumption of sphericity was met for the two-way interaction between SNR and run as assessed using Mauchly's test of sphericity [v 2 (2)  The standard deviation of the difference scores was 11.202 RAU, resulting in a 95% critical difference of 21.955 RAU. Figure 1(B) plots the scores in run 2 against run 1 with the 95% critical difference shown as dashed lines. Five percent of the data points are expected to fall outside of the 95% critical difference by chance, and it can be seen that four out of 90 data points (4.44%) fell outside. For reference, additional critical difference values were as follows: 80% ¼ 14.361 RAU; 90% ¼ 18.427 RAU; 99% ¼ 28.856 RAU. Figure 2 displays Bland-Altman plots for each SNR. Bias is shown as the dashed horizontal lines and the limits of agreement are shown as solid horizontal lines. At each SNR, 29 of 30 data points (96.7%) fell within the limits of agreement. Table 1 shows the bias, 95% CIs for the bias, and the 95% limits of agreement at each SNR. The CIs for the bias at -9 dB SNR did not contain zero, indicating significant bias. 2 The direction of the bias indicated that scores tended to be higher in run 2 than run 1.  Several exploratory analyses were also conducted. One outcome of interest was to compare the standard deviation in scores to the standard deviation expected from the binomial distribution (Thornton and Raffin, 1978). Due to the small sample size, a bootstrapping procedure was conducted to obtain distributions of standard deviations. At each SNR, 30 pairs of scores (in percent correct) from runs 1 and 2 were randomly sampled with replacement. Means and standard deviations were computed for each pair then averaged across the 30 values. This process was repeated 1000 times for each SNR. The bootstrapped standard deviations were compared to the standard deviation of a binomial distribution, SD ¼ 100 Â ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi pð1 À pÞ=n p , where p is the score expressed as a proportion and n is the number of trials (Thornton and Raffin, 1978). Figure 3(A) shows the bootstrapped standard deviations as a function of mean score, along with the standard deviation of the binomial distribution. It can be seen that the standard deviations tended to be largest for scores around 50% correct, as expected from the binomial distribution. Additionally, the average values lie above the binomial standard deviation, indicating higher than expected variability.
Potential learning effects were examined by comparing scores across blocks separately for run 1 [ Fig. 3(B)] and for run 2 [ Fig. 3(C)]. A Friedman test indicated no statistically significant difference in scores across blocks for run 1 Finally, the effect of color and number on performance at run 1 and run 2 was examined. The outcome of interest was correct selection of the color-number combination (a binary variable). A generalized linear mixed model analysis was conducted using the GLIMMIX procedure in SAS. The model included fixed effects of run (1 and 2), color (coded as 0-3), and number (coded as 0-7). The model also included two-way interactions for run Â color, run Â number, and color Â number (the three-way interaction for run Â color Â number was not significant and therefore not included in this model), and a random intercept for each participant. The results are shown in Table 2. The main effects of run, color, and number were statistically significant. However, the interactions run Â color [plotted in Fig. 3(D)] and color Â number [plotted in Fig. 3(E)] were also statistically significant. For both interaction terms, all possible pairwise comparisons were computed. Due to the large number of comparisons, the p-values were adjusted for a 5% false discovery rate (Benjamini and Hochberg, 1995). Figure 3(D) demonstrates a significant increase in mean probability for the color blue from run 1 to run 2 [t(87) ¼ -4.15, adjusted p-value ¼ 0.0286] but not for the other colors. Figure 3(E) demonstrates a complex interaction between color and number on the probability of correctly selecting color-number. Of note, the color-number combination "red 8" had the lowest probability while "white 5" had the highest probability.

Discussion
The purpose of this study was to establish test-retest reliability and critical differences for an implementation of the CRM to study effects of a contralateral MOC reflex elicitor on speech-in-noise perception. In the current study, scores increased significantly with increasing SNR as expected. Additionally, the mean scores at each SNR were consistent with our previous results (see Fig. 2 of Mertes et al., 2019).
It was expected that mean scores at a given SNR would not be significantly different between runs due to the short time interval between runs. Unexpectedly, we found a significant main effect of run where score was significantly higher at run 2 than run 1. Jakien et al. (2017) found a small but significant improvement in performance between the Fig. 2. Bland-Altman plots for each SNR. Difference scores between runs are plotted against the average score between runs. The dashed horizontal line is the mean difference score. The solid horizontal lines represent the 95% limits of agreement. Unfilled circles are individual data points. Due to overlapping data points, the plotted values were randomly jittered between 62% to improve visualization. Note the different abscissa values for each panel. first and second runs that may have been due to learning effects. In the current study, scores were not significantly different across blocks [Figs. 3(B) and 3(C)], which could suggest minimal learning effects on score. An alternative explanation for the difference in performance between runs is a difference in the difficulty of the speech materials (Dillon, 1982). Brungart (2001) demonstrated that percentage correct varies for the different colors and numbers of the CRM corpus. The results of the linear mixed model analysis were consistent with these findings. Interestingly, the run Â color interaction was significant because only the color blue demonstrated a higher probability of correctly selecting color-number combination for run 2 compared to run 1. It is unclear why only the color blue demonstrated this effect. Across all participants, there was a similar number of presentations of the color blue across runs (969 versus 924). It may be possible that there was perceptual learning of the talker's voice for the color blue, but it is unclear why learning would be isolated to one color. Other subject factors such as motivation and sleep may have also impacted the variability (Dillon, 1982) and should be considered in future studies.
It is of note that test-retest reliability for the British English version of the CRM was found to be acceptable when the researchers adjusted the intensity level of the individual color-number combinations to yield similar performance at a given SNR (Semeraro et al., 2017). We did not attempt to adjust the relative intensity of the individual sentences, but this approach could be considered in future studies.
The Bland-Altman analysis revealed that the bias was significantly different from zero only for -9 dB SNR. These results suggest that the main effect of run seen in the repeated-measures ANOVA was primarily driven by the results at -9 dB SNR. The Bland-Altman plots also demonstrated that the expected percentage of data points (95%) fell within the limits of agreement. The negative bias values indicated that scores tended to be higher in run 2, consistent with the main effect of run described above. Additionally, the limits of agreement were narrowest at -12 dB SNR, suggesting that low SNRs may allow for detection of smaller changes relative to higher SNRs.  For our investigations of the MOC reflex, we are interested in determining if the introduction of a contralateral MOC reflex elicitor can significantly improve CRM scores when the speech and masking noise are presented to the ipsilateral ear. The current results suggest that large changes in score due to a contralateral MOC reflex elicitor would be required in order to exceed the 95% critical difference. Specifically, only changes in score that exceed 21.955 RAU could be attributed to the MOC reflex, at least for the current implementation of the CRM. Mertes et al. (2019) acknowledged that they could not determine if the changes in score in their participants were due to contralateral MOC reflex activation. When applying the critical differences in the current study to the results shown in Fig. 3 of Mertes et al. (2019), only one to two participants out of 30 (depending on SNR) had changes in CRM score that could be attributed to contralateral MOC reflex activation rather than measurement variability. It must be noted that there were methodologic differences between Mertes et al. (2019) and the current study (noise type, number of trials, and transducers). Therefore, the current critical differences can only be used as an estimate for our previous study.
We chose a relatively low number of trials compared to other studies (e.g., Brungart, 2001;Eddins and Liu, 2012;Mertes et al., 2019). It has been established that increasing the number of trials reduces the standard deviation under the binomial distribution and also decreases the width of the critical difference (Thornton and Raffin, 1978). Additionally, we found that the standard deviations were higher than expected under the binomial distribution [ Fig. 3(A)]. The number of trials in the current study was selected based on feedback from normal-hearing laboratory members who were previously unfamiliar with the task. These listeners reported that additional trials were fatiguing especially at the lower SNRs. The tradeoff between number of trials and fatigue effects therefore should be carefully considered in future investigations. Additionally, we acknowledge that more trials could have been obtained across multiple sessions (e.g., as in Brungart, 2001). However, in studies of the MOC reflex, it is typical to obtain measures of speech-in-noise perception with and without a contralateral MOC reflex elicitor in the same session along with physiologic measures of MOC reflex activity (e.g., Mertes et al., 2019).
It must be noted that the reliability and critical differences should be empirically determined for other implementations of the CRM. Our future work in this area will establish test-retest reliability and critical differences for the CRM using other SNRs and masker types. Additionally, the use of other speech tasks could be considered. For example, Xu and Cox (2014) reported that the American Dialect Four Alternative Auditory Feature Test had a 95% critical difference of 12 RAU, which is smaller than the critical differences obtained in the current study. When the goal is to determine if a significant change in performance occurred (e.g., due to activation of a contralateral MOC reflex elicitor or due to a progression in hearing loss), the test-retest reliability and critical differences will be important to establish.