The impact of exploiting spectro-temporal context in computational speech segregation

Computational speech segregation aims to automatically segregate speech from interfering noise, often by employing ideal binary mask estimation. Several studies have tried to exploit contextual information in speech to improve mask estimation accuracy by using two frequently-used strategies that (1) incorporate delta features and (2) employ support vector machine (SVM) based integration. In this study, two experiments were conducted. In Experiment I, the impact of exploiting spectro-temporal context using these strategies was investigated in stationary and six-talker noise. In Experiment II, the delta features were explored in detail and tested in a setup that considered novel noise segments of the six-talker noise. Computing delta features led to higher intelligibility than employing SVM based integration and intelligibility increased with the amount of spectral information exploited via the delta features. The system did not, however, generalize well to novel segments of this noise type. Measured intelligibility was subsequently compared to extended short-term objective intelligibility, hit-false alarm rate, and the amount of mask clustering. None of these objective measures alone could account for measured intelligibility. The findings may have implications for the design of speech segregation systems, and for the selection of a cost function that correlates with intelligibility.


I. INTRODUCTION
The overall goal of computational speech segregation systems is to automatically segregate a target speech signal from interfering noise. These systems are relevant for many practical applications, e.g., as pre-processors in communication devices such as hearing aids or cochlear implants (Brungart et al., 2006;Li and Loizou, 2008;Wang et al., 2008) or front-ends in speech and speaker recognition systems for human-computer interfaces (Cooke et al., 2001;May et al., 2012a,b). One frequently-used single-channel approach, termed the ideal binary mask (IBM) technique (Wang, 2005), separates a time-frequency (T-F) representation of noisy speech into target-dominated and interferencedominated T-F units. Given a priori knowledge about the target and the interfering signal, the IBM is constructed by comparing the signal-to-noise ratio (SNR) in individual T-F units to a local criterion (LC). The resulting IBM is a binary matrix where T-F units with SNRs exceeding the LC are considered target-dominated and labeled one, and zero otherwise. Many studies have used IBMs to segregate a target speech signal from a noisy mixture and demonstrated large intelligibility improvements (Brungart et al., 2006;Wang et al., 2008;Kjems et al., 2009). However, a priori knowledge about the target and the interfering noise is rarely available in realistic conditions, and therefore, the goal of segregation systems is to estimate the IBM based on the noisy speech signal. This challenge of obtaining an estimated IBM is typically approached by employing supervised learning strategies (Wang, 2005), which generally consist of a feature extraction front-end and a classification back-end. The front-end extracts a set of acoustic features which attempt to capture speech-and interference-specific properties. The distributions of speech and interference-dominated T-F units are then learned by a classification back-end, through an initial training stage (Kim et al., 2009;Han and Wang, 2012;Healy et al., 2013;May and Dau, 2014a).
When analyzing binary mask patterns, speechdominated T-F units tend to cluster in spectro-temporal regions, forming so-called glimpses, and the size of these glimpses, denoted the glimpse proportion in the model by Cooke (2006), has been shown to correlate with speech intelligibility scores from normal-hearing (NH) listeners (Cooke, 2006;Barker and Cooke, 2007). Consequently, several studies have tried to exploit spectro-temporal contextual information in speech to improve the performance of computational speech segregation systems by predominantly using two strategies. One strategy is to exploit the context in the front-end by calculating so-called delta features (Kim et al., 2009;Hu and Loizou, 2010;May and Dau, 2014b), which capture feature variations across time and frequency. Alternatively, the context can be exploited in the back-end, where the posterior probability of speech presence obtained from a first classifier can be learned by a second classifier across a spectro-temporal window of T-F units, where the amount of spectro-temporal context can be controlled by the size of the window function (Han and Wang, 2012;Healy et al., 2013;May and Dau, 2014a). Some studies have combined both strategies in the front-end and in the back-end (Healy et al., 2013;May and Dau, 2013).
The performance of computational speech segregation systems and the effectiveness of different system configurations have primarily been evaluated based on the hit-false alarm (H-FA) rate, which calculates the difference between the percentage of correctly classified speech-dominated T-F units [hit rate (H)] and the percentage of incorrectly classified noise-dominated T-F units [false alarm rate (FA)] (Kim et al., 2009;Han and Wang, 2012;Healy et al., 2013;Dau, 2013, 2014a,b). However, it has recently been shown that speech intelligibility scores strongly depend on both the distribution of mask errors and the H-FA rate Rozell, 2015, 2016;. Specifically, Kressner and Rozell (2015) developed a graphical model to systematically measure the influence of clustering of T-F units on the intelligibility of binary-masked speech and showed that the intelligibility was reduced when masks contained an increased amount of clustering among T-F units, but the same mask error rates. Thus, the applicability of the H-FA rate as the sole objective measure to optimize or evaluate computational segregation systems has come into question. However, the impact of the different spectro-temporal context-exploring strategies on the amount of clustering of T-F units, or on speech intelligibility, has not yet been analyzed. Kim et al. (2009) were the first to report speech intelligibility improvements for a computational speech segregation system based on Gaussian mixture models (GMMs). They considered a high complexity GMM classifier with 256 components in the back-end for modeling the distribution of the feature vectors in a restricted setup in which the same short noise recording was used during training and testing. By using such a setup, it was possible to achieve high H-FA rates and improve speech intelligibility scores by up to 60% compared to unprocessed noisy speech for NH subjects (Kim et al., 2009). A high complexity classifier is able to learn all spectro-temporal characteristics of the noise, if the same short noise recording is used during training and testing, resulting in high H-FA rates (May and Dau, 2014b) and, most likely, also the high intelligibility scores observed in Kim et al. (2009). The restricted setup therefore has a high potential to improve speech intelligibility and can be used to investigate the behavior of the segregation system by comparing different system configurations. The ability of segregation systems to generalize to unseen acoustic conditions, such as novel segments of the same noise and novel noise types, is, however, an important and active research field (Healy et al., 2015;Chen et al., 2016b) and needs to be addressed at the same time.
In the present study, two experiments were conducted by measuring word recognition scores (WRSs) in NH listeners. In Experiment I, the impact of exploiting spectrotemporal context in the front-end and the back-end of a segregation system, based on GMMs, was systematically investigated to identify the best performing strategy for the system. Specifically, the extraction of the delta features (Kim et al., 2009) was considered in the front-end, and the two-layer classification stage from May and Dau (2014a) was employed in the back-end. Different system configurations were compared here, which either incorporated spectro-temporal context only in the front-end, only in the back-end or in both. These configurations were compared to a baseline configuration that did not include any of the strategies in the front-end and the back-end. This experiment was conducted in a restricted setup, similar to Kim et al. (2009), with high potential to improve speech intelligibility. Furthermore, the effect of the GMM classifier complexity in a segregation system was also investigated by comparing the results obtained with 16 GMM components and 64 GMM components. In Experiment II, the best performing strategy from Experiment I was explored in detail, and the generalization ability was subsequently evaluated in a less restricted setup that considered a mismatch in noise segments during training and testing. Finally, the intelligibility scores from both experiments were related to predictions from objective measures 1 from the extended short-term objective intelligibility (ESTOI) (Jensen and Taal, 2016), the H-FA rate (Kim et al., 2009), and the amount of clustering among T-F units in binary masks (Kressner and Rozell, 2015). The primary focus of the later analysis was to guide the selection of a cost-function that correlates with speech intelligibility for future applications in computational speech segregation systems.

II. THE SEGREGATION SYSTEM
The segregation system consisted of a feature extraction front-end and a classification back-end (May et al., 2015). Figure 1 illustrates the processing stages of the system. Each of these stages is described in more detail below.

A. Front-end
The noisy speech was sampled at a rate of 16 kHz and decomposed into K ¼ 31 frequency channels by employing an all-pole version of the gammatone filterbank (Lyon, 1996), whose center frequencies were equally spaced on the equivalent rectangular bandwidth (ERB) scale between 80 and 7642 Hz. Previous studies (Kim et al., 2009;May and Dau, 2014a;May et al., 2015) have successfully exploited modulations in the speech and the interferer by extracting amplitude modulation spectrogram (AMS) features (Kollmeier and Koch, 1994;Tchorz and Kollmeier, 2003). To derive the AMS features in each subband, the envelope was extracted by half-wave rectification and low-pass filtering with a cutoff frequency of 1 kHz. Then, each envelope was normalized by its median computed over the entire envelope signal. The normalized envelopes were then processed by a modulation filterbank that consisted of one firstorder low-pass and five band-pass filters with logarithmically spaced center frequencies and a constant Q-factor of 1. The cutoff frequency of the modulation low-pass filter was calculated as the inverse of the window duration to ensure that at least one full period of the modulation frequency was included in the window, and subsequently adjusted to the nearest power of 2 integer (May et al., 2015). Using a time frame duration of 32 ms then resulted in a cutoff frequency of 32 Hz. The root-mean-square value of each modulation filter was then calculated across each time frame with a 75% overlap. The extraction of the AMS features resulted in a six-dimensional feature vector for each T-F unit Aðt; f Þ ¼ fM 1 ðt; f Þ; …; M 6 ðt; f Þg T . The delta features across time ðD T Þ and frequency ðD F Þ can be appended to the feature vector Aðt; f Þ according to previous studies (Kim et al., 2009;Han and Wang, 2012;May and Dau, 2013), resulting in a feature vector Xðt; f Þ for each individual T-F unit at time frame t and subband f that consists of Instead of the calculation in Eq. (1), delta features that only operate across frequency can be considered and appended symmetrically to the AMS features for a resulting feature vector Xðt; f Þ, 8k 2 fn 2 1; K ½ jf þ n Kg: In Eq.
(2), k indicates the considered number of subbands in the calculation, and K the number of gammatone filters. Appending the delta features to the feature vector in Eqs.
(1) and (2) increased the amount of exploited spectro-temporal context, but also the size of the feature vector; e.g., appending D T Aðt; f Þ and D F Aðt; f Þ from Eq. (1) to Aðt; f Þ would increase the feature vector from 6 to 18 dimensions.

B. Back-end
Similar to previous studies, the classification back-end consisted of a two-layer segregation stage (Healy et al., 2013;May and Dau, 2014a;May et al., 2015). In the first layer, a GMM classifier was trained to represent the speechand noise-dominated AMS feature distributions (k 1; f and k 0; f ) for each subband f. To separate the feature vector into speech-and noise-dominated T-F units, the LC was applied to the a priori SNR, and the a priori probabilities Pðk 1; f Þ and Pðk 0; f Þ were computed by counting the number of feature vectors for each of the classes k 1; f and k 0; f during training. The GMM classifier output was given as the posterior probability of speech and noise presence Pðk 1; f jXðt; f ÞÞ and Pðk 0; f jXðt; f ÞÞ, respectively, For each subband, the computed posterior probabilities of speech Pðk 1; f jXðt; f ÞÞ were processed by a linear support vector machine (SVM) classifier (Chang and Lin, 2011) across a spectro-temporal window W (May and Dau, 2014a), Xðt; f Þ ¼ fPðk 1;u jXðu; vÞÞ : ðu; vÞ 2 Wðt; f Þg: The size of the window W determined the amount of spectro-temporal context exploited around the considered T-F unit. A causal and plus-shaped window function W was used here, where the window size with respect to time and frequency was controlled by Dt and Df , respectively. Further details regarding the choice of the second-layer classifier and the size and shape of the window function W can be found in May and Dau (2014a).

A. Configurations
To systemically analyze the impact of spectro-temporal context strategies in the front-end and the back-end, four system configurations were tested in Experiment I (see Table I).
The "No context" configuration denotes the baseline configuration with no delta features in the front-end and no spectro-temporal integration in the back-end, corresponding to setting the window size W to unity (Dt ¼ 1; Df ¼ 1). The FIG. 1. (Color online) Block diagram of the speech segregation system. The system consists of a feature extraction front-end and a classification back-end. In the front-end, the noisy speech is first decomposed by a gammatone filterbank. Then, AMS features are extracted and delta features are computed. The backend consists of two layers with a GMM classifier in the first layer and a SVM classifier in the second layer. Finally, the estimated ideal binary mask is applied to the subband signals of the noisy speech, as illustrated by the dashed line, in order to reconstruct the target signal.
"Front-end" configuration includes the delta features, while the "Back-end" configuration includes the second-layer classification stage in the back-end (Dt ¼ 3; Df ¼ 9). The "Front-& back-end" configuration employs both the frontend and the back-end spectro-temporal context strategies.
In Experiment II, the delta features were explored in details in order to investigate the potential of this strategy in the segregation system. Four configurations were selected (see Table II). The system configuration "Front-end" is the baseline configuration for the analysis across frequency and appends only D F Aðt; f Þ to Aðt; f Þ. The configurations "3 subbands," "7 subbands," and "11 subbands" include k ¼ 1, k ¼ 3, and k ¼ 5 lower and upper subbands to Aðt; f Þ.

B. Stimuli
The speech material came from the Danish Conversational Language Understanding Evaluation (CLUE) database (Nielsen and Dau, 2009). It consists of 70 sentences in seven lists for training and 180 sentences in 18 balanced lists for testing, and is spoken by a male Danish talker. Noisy speech mixtures were created by mixing individual sentences with a stationary (ICRA1) and a fluctuating six-talker (ICRA7) noise (Dreschler et al., 2001). A Long Term Average Spectrum (LTAS) template was computed based on the CLUE corpus and the LTAS of each noise masker was adjusted to the template LTAS. A randomly-selected noise segment was used for each sentence. In order to avoid onset effects in the intelligibility test (Nielsen and Dau, 2009), the noise segment started 1000 ms before the speech onset and ended 600 ms after the speech offset. However, the objective measures were computed only for the regions between speech onset and offset.

C. System training and evaluation
In Experiment I, the segregation system was trained separately for the two noise types limited to 10 s in duration.
Originally, the ICRA1 consists of a 60 s noise recording and ICRA7 of a 600 s recording (Dreschler et al., 2001). The first layer of the classification back-end consisted of a subband GMM classifier with either 16 or 64 components and full covariance matrices. The classifiers were first initialized by 15 iterations of the K-means clustering algorithm, followed by five (for 16 GMMs) or 50 (for 64 GMMs) iterations of the expectation-maximization algorithm. The classifiers were trained with the 70 training sentences that were each mixed three times with a randomly-selected noise segment from 10 s noise recordings at À5, 0, and 5 dB SNR. The subsequent linear SVM classifier was trained for each subband with only ten sentences mixed at À5, 0, and 5 dB SNR. Afterwards, a re-thresholding procedure was applied (Han and Wang, 2012;May and Dau, 2014a) using a validation set of ten sentences, where new SVM decision thresholds were obtained which maximized the H-FA rates. Both the first and second-layer classifiers employed an LC of À5 dB in a similar manner as previous findings (Han and Wang, 2012;May and Dau, 2014b). The segregation system was evaluated with the 180 CLUE sentences. Each sentence was mixed with the noises at À5 dB SNR using the same limited noise recordings from the training session.
Experiment II only tested the highly non-stationary ICRA7 noise type in a less restricted setup. This noise type is more likely to challenge a speech segregation system than the stationary ICRA1. The full noise recording of 600 s was divided into one half recording for training and one half recording for testing. The training and evaluation was similar to Experiment I. The first layer of the classification back-end had a complexity of 16 GMMs with full covariance matrix. The complexity choice is discussed in Sec. V B.

D. Test procedure and subjects
In Experiment I, the following 24 conditions were tested: (Noisy speech, No integration, Front-end, Back-end, Front-& back-end, IBM) Â (ICRA1, ICRA7) Â (16 GMMs, 64 GMMs). The total number of conditions (24) exceeded the number of available CLUE test lists (18). Therefore, to be able to randomly assign one condition to one test list, the experiment was conducted with two subject groups, each with n ¼ 15 NH listeners. The first subject group was tested with the 12 conditions corresponding to the classifier complexity of 16 GMMs, and the second group was tested with the 12 conditions with only 64 GMMs. The following five conditions were tested in Experiment II: Noisy speech, Front-end, 3 subbands, 7 subbands, and 11 subbands. The experiment was conducted with one subject group with n ¼ 20 NH listeners that differed from the subject groups used in Experiment I. In this experiment, 13 other conditions were also tested that were not relevant to this study.
The listener age was between 20 and 32 yr with a mean of 24.5 yr in Experiment I and a mean of 26.7 yr in Experiment II. Requirements for participation were: (1) age between 18 and 40 years, (2) audiometric thresholds of less than or equal to 20 dB hearing level (HL) in both ears (0.125 to 8 kHz), (3) Danish as native language, and (4) no previous experience with the Hearing In Noise Test (HINT) (Nielsen  and Dau, 2011) or CLUE (Nielsen and Dau, 2009). The total experimental time was about 2 h in Experiment I and about 1.5 h in Experiment II, including the screening process. The subjects were paid for their participation. The experiments consisted of a training and testing session. During the training session, five randomly selected sentences from the training set were presented for each of the 12 conditions to familiarize the subject to the task. Subsequently, each subject heard one list per condition, and conditions and lists were randomized across subjects. The sentences were presented diotically to the listener via headphones (Sennheiser HD650) in an acoustically and electrically shielded booth. Prior to the actual experiments, the headphones were calibrated by first adjusting to a reference sound pressure level (SPL) value and then performing a headphone frequency response equalization. During the experiment, the sentences were adjusted to the desired presentation level, and the equalization filters were applied. The SPL was set to a comfortable level of 65 dB. The presentation level was only increased after the training session if the subject reported back that the level was too low. The level never exceeded 70 dB SPL for any subject. For each sentence, the subjects were instructed to repeat the words they heard, and an operator scored the correctly understood words via a MATLAB interface. The subjects were told that guessing was allowed. They could listen to each sentence only once, and breaks were allowed according to the subject's preference.

E. Statistical analysis
Intelligibility scores were reported as a percentage of correctly scored words, i.e., the WRS, at À5 dB SNR. The WRSs were computed per sentence and averaged across sentences per list. The averaged WRSs were used to construct a linear mixed effect model for each experiment. In Experiment I, the three fixed factors of the mixed model were the system configuration (four levels), the noise type (two levels), and the classifier complexity (two levels). The subjects were treated as a random factor, as is standard in a repeated measure design. The intelligibility scores in Experiment I followed a normal distribution. All fixed effects, all interactions between fixed effects, and the random effect were initially included in the model. The model was then reduced by performing a backward elimination of all random and fixed interactions that were non-significant. This included all of the interaction terms between the random effect (subjects) and the fixed factors (configuration, noise type, and classifier complexity) and the interaction term between all three fixed factors. In Experiment II, the only fixed factor was system configuration (four levels) and subjects were treated as a random factor. The intelligibility scores in Experiment II also followed a normal distribution.
All levels were tested at a 5% significance level. To visualize the data, the least-squares means and 95% confidence intervals were extracted from the model. To assess any difference between conditions, the differences of the least-squares means were computed and the p values were adjusted following the Tukey multiple comparison testing.
To evaluate potential speech intelligibility improvements, Paired Students t-tests between the noisy speech and each of the system configurations were constructed and tested at a 5% significance level.

F. Objective measures
Three different objective measures were compared to the intelligibility scores in each experiment: ESTOI (Jensen and Taal, 2016), H-FA rate (Kim et al., 2009), and the clustering parameter c (Kressner and Rozell, 2015). The ESTOI (Jensen and Taal, 2016) is a modified version of the shortterm objective intelligibility (STOI) index (Taal et al., 2011) to better account for modulated noise maskers. The STOI metric is based on a short-term correlation analysis between the clean and the degraded speech (Taal et al., 2011), mapped to a value between 0 and 1. The ESTOI improvements (D ESTOI) were reported here as the relative difference between the predicted ESTOI values for the processed and the unprocessed noisy speech baselines. To compute the H-FA rate, the correctly classified speech-dominated T-F units and incorrectly classified noise-dominated T-F units were derived by comparing the estimated IBM with the IBM. The H-FA rates and the ESTOI improvements were averaged across all 180 test sentences. The clustering parameter c was learned across all 180 test sentences by the graphical model described in Kressner and Rozell (2015). Given a set of binary masks, the graphical model estimates the amount of clustering c between T-F units within the masks as a single number. c quantifies how much more likely neighboring T-F units are to have the same label (speechdominated or noise-dominated) as opposed to different labels. Therefore, binary masks with T-F units that are twice as likely to have the same label than a different label as their neighboring units would be described by c ¼ 2:0. Binary masks with T-F units that are equally likely to be in the same state as their neighbors would have a c ¼ 1:0, indicating that the labels of the T-F units would be uniformly and randomly distributed. Therefore, a mask with c ¼ 2:0 will contain more clustering among the T-F units than a mask with c ¼ 1:0 (Kressner and Rozell, 2015). To illustrate the c parameter, Fig. 2 shows binary masks for one particular CLUE sentence mixed with ICRA7 noise at À5 dB SNR with the respective c values, shown in parenthesis. Figure  2(a) shows the IBM and Figs. 2(b)-2(e) present the estimated IBMs for the four tested system configurations listed in Table I. The two mask error types, misses and false alarms, are shown on top of the binary masks for a visualization of the error distributions. Comparing the masks for the four tested system configurations, the masks from Fig. 2(d) and Fig. 2(e) contain a larger amount of clustering than the masks in Fig. 2(b) and Fig. 2(c).

IV. RESULTS
A. Experiment I: Impact of exploiting spectro-temporal context Figure 3 shows intelligibility scores obtained with the four system configurations ("No Context," "Front-end," "Back-end," and "Front-& back-end") in the two noise types (ICRA1 and ICRA7) considered in Experiment I. Results are shown for the two classifier complexities, namely 16 GMMs in Fig. 3(a) and 64 GMMs in Fig. 3(b). The condition with the unprocessed noisy speech (diamonds) represented the baseline, and the IBM condition (stars) was considered as the ideal reference. For the baseline and the ideal reference, sample means across subjects and 95% Students t-based confidence intervals of the mean were computed. For the system configurations, the least square means and 95% confidence intervals from the fitted linear mixed effect model were considered.
The baseline in Fig. 3 differed across noise types, with WRS of about 50%-55% for the stationary ICRA1 and 65% for the fluctuating ICRA7, presumably because the participants were able to listen in the dips in the six-talker noise. For the IBM conditions, WRS of close to 100% was achieved for both noise types. This was expected as the IBM exploited the a priori information about the speech and the noise signals.
There was an effect of system configuration depending on the classifier complexity and on the noise type. Most importantly, the "Front-end" configuration led to significantly higher intelligibility scores than the "Back-end" configuration for both noise types and both classifier complexities (p < 0.0001). Specifically, the WRS increased by 18.0% in ICRA1 and 23.1% in ICRA7 with 16 GMMs [ Fig. 3(a)], and 28.8% in ICRA1 and 34.0% in ICRA7 with 64 GMMs [Fig. 3(b)]. This particular finding suggests that extracting and appending the delta features to the AMS features in the front-end is a more effective way of exploiting spectro-temporal contextual information than using the SVM-based integration strategy in the back-end. In all four combinations, except with 16 GMMs in the case of the ICRA1 noise, the "Front-end" configuration led to significantly larger scores than the "No context" configuration, which emphasizes that it is more effective to exploit contextual information in the front-end of the system than not considering any strategy at all. Finally, the "Front-& back-end" configuration also led to significantly higher scores than the "Back-end" configuration in all four combinations of noise type and classifier complexity. However, the mean scores for the "Front-& back-end" were generally lower than for the "Front-end." This suggests that employing both strategies is more effective to exploit spectro-temporal context than just employing the SVM-based integration strategy in the backend alone, but the combination of the two strategies does not lead to better results than the front-end strategy alone.
There was also an effect of the classifier complexity that depended on the system configuration and the noise type. By comparing the results in Figs. 3(a) and 3(b), significantly higher scores were obtained for the "Front-end" configuration with 64 GMMs than with 16 GMMs for both noise types. Specifically, the WRS increased by 12.6% in ICRA1 (p < 0.05) and 19.5% in ICRA7 (p < 0.0001). However, the scores for the "Back-end" configuration did not change significantly across classifier complexity for either noise type. Most importantly, the ranking of the system configurations remained unchanged across classifier complexity.
The measured intelligibility scores from Fig. 3 were converted into WRS improvements relative to the unprocessed noisy speech, DWRS. Figures 4(a) and 4(b) show DWRS as a function of the system configuration, noise type, and classifier complexity. Significant improvements, based on the Paired Students t-tests, are indicated by an asterisk (*). Significant improvements of about 50% for ICRA1 and 35% for ICRA7 over noisy speech were obtained with the IBM. For 64 GMMs in Fig. 4(b), the configurations "No Context" (t½14 ¼ À2:16; p ¼ 0:02), "Front-end" (t½14 ¼ À4:29; p ¼< 0:001), and "Front-& back-end" (t½14 ¼ À2:82; p ¼ 0:007) for ICRA1 led to significant improvements and for the ICRA7, only the "Front-end" (t½14 ¼ À7:44; p ¼< 0:001) led to a significant improvement. To evaluate the potential of the objective measures, the measured intelligibility scores were related to predictions from each of the objective measures described in Sec. III F.

4(f)] and c [Figs. 4(g) and 4(h)] in Experiment I. DESTOI
indicates the increase in ESTOI relative to the unprocessed noisy speech. The largest predicted improvement was observed for the configuration "Front-& back-end," and the lowest predicted improvement was found for the "No context" configuration in all combinations of noise type and classifier complexity level. This is in conflict with the measured DWRS in Figs. 4(a) and 4(b) where the "Front-end" configurations led to the largest improvements. By comparing Figs. 4(c) and 4(d), it can be seen that larger ESTOI improvements were generally observed with 64 GMMs compared to 16 GMMs. This is consistent with the measured WRS improvements in Figs. 4(a) and 4(b).
Figures 4(e) and 4(f) show the H-FA rates. The segregation system generally produced higher H-FA rates in the presence of the stationary noise than in the presence of the non-stationary six-talker noise. The six-talker noise contains spectro-temporal modulations, similar to modulations in the target speech signal, and it will be more difficult for the classifier to separate the speech modulations from the six-talker noise modulations. In all combinations of noise type and classifier complexity, the lowest H-FA rates were observed for the "No context" configuration and the highest H-FA rates were found for the "Front-& back-end" configuration. Also, larger H-FA rates were obtained for the "Back-end" than for the "Front-end" configuration, which is not consistent with Figs. 4(a) and 4(b). Furthermore, higher H-FA rates were obtained with 64 GMMs in Fig. 4(f) than with 16 GMMs in Fig. 4(e). A comparison with the measured WRS improvements in Figs. 4(a) and 4(b) indicated a conflict with this prediction, since the "Front-end" configuration led to the highest intelligibility scores, but not the highest H-FA rates. The c values from system configurations that exploited spectro-temporal context through the SVM based integration strategy in the back-end ("Back-end" and "Front-& backend") were consistently larger than the c values learned over masks from the "Front-end" and the "No context" configurations. Furthermore, the "Front-end" did not lead to larger c values than the "No context." This suggests that computing delta features in the front-end does not increase the amount of clustering in contrast to employing a spectro-temporal SVM based integration strategy in the back-end. The effect of exploiting spectro-temporal context in binary masks was visualized in Fig. 2 in Sec. III. Figures 2(d) and 2(e) showed masks with a larger amount of T-F clustering than the masks in Figs. 2(b) and 2(c), and a visual inspection of the example utterance indicated that the erroneous T-F units became more clustered in Figs. 2(d)-2(e). Finally, a comparison of Figs. 4(g) and 4(h) suggests that the amount of clustering in the mask is not affected by the classifier complexity in the segregation system, as c remains unchanged.
B. Experiment II: Exploring delta features and the system generalization ability Figure 5 shows intelligibility scores obtained in Experiment II with the four system configurations ("Frontend," "3 subbands," "7 subbands," and "11 subbands") tested in the less restricted setup in ICRA7 noise. For all four configurations, the D T Aðt; f Þ from Eq. (1) was not FIG. 3. (Color online) Experiment I's WRSs at À5 dB SNR of the four different system configurations ("No Context," "Front-end," "Back-end," and "Front-& back-end") for the two noise types (ICRA1 and ICRA7) and for the two classifier complexities plotted in panel (a) (16 GMMs) and panel (b) (64 GMMs). The condition with the unprocessed noisy speech represented the baseline and the IBM condition was considered as the ideal reference. For the baseline and the ideal reference, sample means across subjects and 95% Students t-based confidence intervals of the mean were computed. For all system configurations in all combinations of noise type and classifier complexity, the least square means and 95% confidence intervals from the fitted linear mixed effect model were plotted.
appended to the feature vector in Eq. (2). This decision was based on an analysis of the objective measures prior to Experiment II, which showed no change in the objective measures when D T Aðt; f Þ was left out. In Fig. 5, the level of the noisy speech was consistent with the level in Experiment I for ICRA7 (see Fig. 3). In this experiment, there was an effect of system configuration. The intelligibility scores were significantly higher in the "3 subbands" configuration than the "Front-end" configuration by 10.7% (p < 0.01) and from the "3 subbands" to the "7 subbands" configuration by 8.2% (p < 0.05). The "7 subbands" and the "11 subbands" configurations did not differ significantly. This finding indicated that appending more subbands, as proposed in Eq. (2), can lead to significantly higher intelligibility until a plateau at k ¼ 5 with "11 subbands." Figure 6 presents the intelligibility improvements and objective measure predictions for Experiment II. In Fig. 6(a), the Paired Students t-tests showed that all system configurations led to significantly smaller intelligibility scores than the noisy speech, despite an increase in intelligibility over appended subbands. Therefore, none of the system configurations were able to improve speech intelligibility in the less restricted setup. Since this setup included novel noise segments in testing not seen during training, this suggested that the segregation system did not generalize well to unseen noise segments of the six-talker noise.
In Fig. 6(b), all predicted DESTOI values were positive, and the largest predicted improvements were observed for the configurations "7 subbands" and "11 subbands." This was not consistent with results from the listener study in Fig.  6(a), where no WRS improvements were observed, which highlights the discrepancy between predicted and measured intelligibility improvements in this study. The H-FA rate in FIG. 4. (Color online) Experiment I's DWRS relative to noisy speech (first row of panels), DESTOI relative to noisy speech (second row of panels), H-FA rates (third row of panels), and c values (fourth row of panels) for the four different system configurations with the two noise types (ICRA1 and ICRA7) and with the two classifier complexities in (a) and in (b). The IBM has been included as the ideal reference. WRS improvements are derived from the Paired Students t-tests and significant improvements (on a 5% significance level) are marked with an asterisk (*). All objective measures are evaluated at À5 dB SNR. Fig. 6(c) increased with the number of appended subbands, whereas the rates were comparable for "7 subbands" and "11 subbands." As observed in Experiment I, a small change in H-FA had a large impact on the measured intelligibility scores. This was illustrated by comparing Fig. 4(e) for the ICRA7 noise and Fig. 6(c). A H-FA rate of 35.3% in Fig.  4(e) corresponded to a 4.5% decrease in WRS for the "Front-end" configuration, whereas a H-FA of 33.6% in Fig.  6(c) corresponded to a 31.1% decrease in WRS over noisy speech. With respect to clustering [ Fig. 6(d)], c did not change with the system configuration, suggesting that the amount of clustering in the mask is not affected by appending more subbands to the AMS features. This is in contrast to the Experiment I where the SVM integration stage in the back-end increased both H-FA and c.

V. DISCUSSION
A. The impact of exploiting spectro-temporal context The measured intelligibility scores in Experiment I (Sec. IV A) showed that the front-end strategy, where the system was given access to both the AMS features and the delta features, led to significantly higher intelligibility scores than employing the back-end strategy, which incorporated the SVM-based spectro-temporal integration. The scores were consistently higher for the front-end strategy than the back-end strategy, regardless of the noise type and classifier complexity. Moreover, compared to the unprocessed noisy speech, the back-end strategy actually had a detrimental effect on the intelligibility scores. The comparison of the objective measures in Fig. 4 (Sec. IV A) indicated that the back-end strategy increased the H-FA rates over the front-end strategy but, at the same time, increased the amount of clustering of individual T-F units. The visual inspection of the illustrated mask examples in Fig. 2 (Sec. III F) furthermore suggested that the increased amount of clustering implied an increased clustering of the misses and false alarms. Previously, it was shown that clustering of the two error types results in reduced intelligibility scores despite having the same classification accuracy (Kressner and Rozell, 2015), which may explain the detrimental effect of the back-end strategy on the present intelligibility scores. Furthermore, computing delta features in the front-end had a positive effect on speech intelligibility. The intelligibility scores were significantly higher than the scores with the FIG. 5. (Color online) Experiment II's WRSs at À5 dB SNR with the four different system configurations ("Front-end," "3 subbands," "7 subbands," and "11 subbands") in ICRA7. The condition with the unprocessed noisy speech represented the baseline. For the baseline, sample means across subjects and 95% Students t-based confidence intervals of the mean were computed. For all system configurations, the least square means and 95% confidence intervals from the fitted linear mixed effect model were plotted.
FIG. 6. (Color online) Experiment II's DWRS relative to noisy speech (first row of panels), DESTOI relative to noisy speech (second row of panels), H-FA rates (third row of panels), and c values (fourth row of panels) with the four different system configurations in ICRA7. WRS improvements are derived from the Paired Students t-tests and significant improvements (on a 5% significance level) are marked with an asterisk (*). All objective measures are evaluated at À5 dB SNR. configuration that did not employ any of the strategies, and improvements over noisy speech were significant for the higher complexity classifier of 64 GMMs. Because of the detrimental effect of the back-end strategy on intelligibility, combining both strategies simultaneously in the front-end and in the back-end did not lead to the largest measured intelligibility scores in Sec. IV A. This contradicted the findings in Fig. 4(e) and Fig. 4(f) (Sec. IV A) where a higher H-FA rate was found when combining the strategies than employing only one of the strategies, consistent with the literature (Healy et al., 2013;May and Dau, 2013). The results from Experiment I therefore suggest that, in the considered segregation system, a better spectro-temporal strategy is to compute delta features of the AMS features in the front-end rather than employing the selected SVM-based integration strategy in the back-end. This study, however, did not consider the effects of changing the shape and the size of the window in the back-end on measured intelligibility. Also, the effect of employing a different secondlayer classifier is currently unknown. Healy et al. (2013) considered a similar two-layer classification stage, but they employed deep neural networks (DNNs) in a DNN-DNN layer with an integration window of size five time frames and 17 subbands of the 64 channels. They reported significant improvements in intelligibility scores with this system, but did not quantify the impact of the back-end strategy alone.
In Experiment II, the front-end strategy was explored in detail by appending delta features computed from symmetrical subbands. Results in Sec. IV B showed that the intelligibility scores increased with the number of appended subbands up to k ¼ 5 bands where the improvement reached a plateau. This indicated that intelligibility increased with the amount of spectral information in the speech that was exploited up to k ¼ 5 subbands. The same trend was observed for the H-FA rate in Fig. 6. Appending the delta features across frequency increased the size of the feature vector, and the larger amount of training data led to improvements in H-FA rate for the higher complexity classifier of 64 GMMs compared to the 16 GMMs classifier. Moreover, the amount of clustering among the T-F units in Experiment II was equal to the amount of clustering for the front-end strategy in Experiment I and remained constant with the number of appended subbands. This is in line with the notion from Experiment I that increased accuracy without increased clustering among the T-F units can lead to higher intelligibility scores.
Other strategies exists that exploit the contextual information in speech. In contrast to the delta features, which work on a subband level, temporal context can also be exploited by stacking feature frames as input to broadband DNNs for classification (Wang et al., 2014;Chen et al., 2016b). However, the impact of this particular strategy on intelligibility scores, or any of the objective measures, has not been quantified, which makes a comparison to the strategies in the present study challenging.

B. The generalization ability of the segregation system
In Experiment I, a restricted setup from Kim et al. (2009), with matched noises during training and testing, was used in order to facilitate a comparison of the system configurations, and for a comparison across GMM classifier complexity. May and Dau (2014b) compared H-FA rates for matched and mismatched noise segments of the same noise type in training and testing as a function of the number of GMMs in the classification stage. A high complexity classifier of 256 GMMs employed in Kim et al. (2009) was able to learn all spectro-temporal characteristics of the noise, when the same short noise segment was used in training and testing. This was due to an over-fitting of the segregation system which resulted in high H-FA rates (May and Dau, 2014b) and potentially explains the high intelligibility scores obtained in the study. In Experiment I, these observations from May and Dau (2014b) were verified. The measured intelligibility scores of the front-end strategy were higher with 64 GMMs in Fig. 3(b) compared to the lower complexity classifier of 16 GMMs in Fig. 3(a). Employing the same amount of components as in Kim et al. (2009) would likely result in intelligibility scores at ceiling and close to the IBM.
The ability of segregation systems to generalize to acoustic conditions not seen during training is a very important aspect. In Experiment II, novel noise segments in testing not seen during training were considered. Despite the fact that intelligibility increased with appended subbands in Fig.  6(a), none of the configurations were able to improve speech intelligibility over noisy speech, suggesting that the system did not generalize well to unseen noise segments of the sixtalker noise. This noise type contains spectro-temporal modulations very similar to modulations in the target speech signal. Therefore, the task of improving intelligibility in a realistic setup is non-trivial. According to May and Dau (2014b), the H-FA rates were generally lower when the considered segregation system was tested with unseen noise segments of the same noise recording, and the rates decrease with increasing GMM classifier complexity. Therefore, in a more realistic setup like in Experiment II, choosing a lower complexity classifier will reduce the risk of over-fitting the system (May and Dau, 2014b), however at the expense of lower H-FA rates and lower intelligibility outcomes.
Other studies have successfully demonstrated a generalization ability to acoustical mismatches by employing DNNs because of their predictive power and the ability to benefit from large-scale training for feature learning (Healy et al., 2015;Chen et al., 2016aChen et al., , 2016b. In Healy et al. (2015), a four-hidden layer DNN was applied and tested on novel segments of the same noise type, which led to a 25% improvement in WRS in 20-talker babble at À5 dB SNR in NH listeners, but no improvement in cafeteria noise. In Chen et al. (2016b), a multi-conditional training set was introduced, and a classifier was trained using a five-hidden layer DNN and tested for a range of novel noise types. For the same 20-talker noise at À5 dB SNR, they were able to improve the WRS by approximately 10% in NH listeners. The amount of training employed in these two studies, however, differs from the current study. In Healy et al. (2015) 560 Â 50 ¼ 28 000 utterances were used for each noise type and SNR, and in Chen et al. (2016b) 640 000 utterances were used in the multi-conditional training set. In the current study, only 210 utterances were used for training of the GMM classification stage. The capability of the DNNs to handle large-scale training data is most likely key to an increased ability to generalize to the unseen acoustical conditions. C. Implications for cost function design  highlighted potential limitations of STOI in predicting the intelligibility of binary-masked speech. In the present study, ESTOI was employed instead of STOI, but several observations indicated that ESTOI has similar limitations as STOI. First of all, in Experiment I, the ranking of the system configurations for the ESTOI improvements conflicted with the ranking of the configurations for the measured intelligibility improvements, as was observed in Fig. 4. Second, in Experiments I and II, ESTOI predicted improvements of the system configurations when no intelligibility improvements were actually present. In Experiment I, the listener study only revealed improvements for configurations with the 64 GMMs classifier, and in Experiment II, no improvements were observed at all. Therefore, ESTOI alone is not able to account for the observations in this study. Furthermore, the H-FA metric was also not able to correctly predict the ranking of the system configurations in Experiment I. Specifically, the H-FA rate was consistently higher for the back-end strategy than the front-end strategy, despite the fact that the intelligibility study revealed an opposite effect. Therefore, it is possible to construct a segregation system that is able to improve H-FA and ESTOI, but at the same time fails to improve speech intelligibility scores in noisy conditions. This reveals the limitations of the two measures and emphasizes the need of a single objective measure that comprehensively predicts segregation performance and correlates well with intelligibility for speech segregation systems.
The findings from Experiment I and II have important implications for the design of cost functions in computational speech segregation systems. Monitoring the amount of mask clustering c in the estimated IBMs seems critical as the clustering among erroneously-labeled T-F units should be minimized. The IBM itself inherently contains clustering, and the obtained c value can be regarded as the accepted amount of clustering among the correctly-labeled T-F units. Therefore, an appropriate cost function should maximize the H-FA rate and approximate c as close as possible to c of the IBM.

VI. CONCLUSION
In this study, two experiments were conducted with NH listeners. In Experiment I, the impact of spectro-temporal context in a computational speech segregation system was investigated by considering two strategies in the system front-end and back-end, respectively. The experiment showed that computing delta features in the front-end led to higher speech intelligibility than employing an SVM-based integration strategy in the back-end. The results were consistent across different noise types and for different classifier complexities. In Experiment II, the delta features were explored in detail and tested in a setup that considered novel noise segments of the same six-talker noise. Intelligibility scores increased with the amount of spectral information exploited, but the segregation system did not generalize well to novel noise segments of this particular noise type. The intelligibility scores were subsequently compared to predictions from several objective measures. The comparison showed that no single measure could account for all intelligibility scores, and therefore emphasizes the need of a single objective measure that comprehensively predicts segregation performance and correlates well with intelligibility. The findings from the present study may have implications for the design of computational speech segregation systems, in which spectro-temporal context should be incorporated without increasing the amount of clustering among erroneous labeled T-F units. Furthermore, the findings can help select a cost function that correlates with intelligibility. According to the results in the present study, the cost function should maximize the H-FA rate and approximate the c value as close as possible to the c of the IBM.