Enhancement by postfiltering for speech and audio coding in ad-hoc sensor networks

Enhancement algorithms for wireless acoustics sensor networks~(WASNs) are indispensable with the increasing availability and usage of connected devices with microphones. Conventional spatial filtering approaches for enhancement in WASNs approximate quantization noise with an additive Gaussian distribution, which limits performance due to the non-linear nature of quantization noise at lower bitrates. In this work, we propose a postfilter for enhancement based on Bayesian statistics to obtain a multidevice signal estimate, which explicitly models the quantization noise. Our experiments using PSNR, PESQ and MUSHRA scores demonstrate that the proposed postfilter can be used to enhance signal quality in ad-hoc sensor networks.


Introduction
The emergence of connected and portable devices like smartphones, and the rising popularity of voice user-interfaces and devices equipped with microphones, enable the necessary infrastructure for ad-hoc wireless acoustic sensor networks (WASNs).The dense, ad-hoc positioning and collaboration in a WASN leads to efficient sampling of the acoustic space, thereby gaining higher quality signal estimates compared to single-channel estimates [1].Typical applications of ad-hoc WASNs use microphones on low-resource devices, such that we need low-complexity methods and which use bandwidth efficiently to compress and transmit the acoustic signals.This involves quantization at the encoder, whereby the received signal at the decoder is usually degraded by quantization noise [2,3,4,5,6].
Past works on WASN often overlook the variability in maximum capacity of sensors [7].However, rate-constrained spatial filtering like beamforming and multichannel Wiener filtering have been used in binaural hearing aids (HAs) [8,9,10,11,3].A study on rate-constrained optimal beamforming showed the advantage of using spatially separated microphones in HAs, although the method assumes that the joint statistics of signals are available at the processing nodes [8].Subsequently, sub-optimal strategies for noise reduction which do not use the joint statistics at the nodes have been proposed [8,9,10,11,12].While the above methods are effective in reducing noise, they are either limited to, or are most efficient with two nodes (HAs) only.In a recent work on multi-node WASN, a linearly-constrained minimum variance beamformer was used to optimize rate allocation and sensor selection over nodes, based on spatial location and frequency content [13,14].However, due to the dynamic nature of an ad-hoc WASN, information about sensor distribution, location, number of target and interference sources may be either unavailable, or their exchange between nodes further adds to the bandwidth consumption and communication complexity.Further, the above methods assume an additive quantization noise model, which is accurate only at higher bitrates.Lastly, while all the above methods are optimized on Wyner-Ziv coding, their suitability in combination with existing speech and audio coding has not been demonstrated yet.Their performance in single-channel mode can therefore not compete with conventional single-channel codecs.In this paper, we propose a Bayesian postfilter for enhancement in ad-hoc WASNs, which explicitly models the quantization noise within the optimization framework of the filter, and can be applied on top of existing codecs with minimal modifications.Thus, the main contribution of the current work is the postfilter which takes quantization into account through truncation, while retaining the conventional assumption of additive Gaussian background noise, thereby resulting in a truncated Gaussian representation of the clean speech distribution.To evaluate the proposed methodology, we place the necessary assumptions that the devices are dominantly degraded by either background noise and reverberation, or coding noise due to quantization, and each device operates at its maximum capacity.In line with past works, we show that by distributing the total available bitrate between the two sensors, the output gain of the WASN signal estimate is higher than the output gain of a low input-SNR single sensor transmitting at full bitrate [8,9,10,11].
In addition, we present the advantages of incorporating the exact quantization noise models within the optimization framework.In order to focus on the effect of the postfilter on quantization noise, we apply the proposed method on the output of a codec [15], which is specifically designed to address multi-device coding.To the best of our knowledge, this is the first time a complete WASN system is evaluated with competitive performance also in a single-channel codec mode.Although we have not yet included models of spatial configuration of sensors, room impulse responses or multiple sources, we show that the proposed method already yields large output gains.

Methodology
To focus on the novel aspects of the approach, we consider a simple WASN consisting of two devices with microphones: 1. a low-resource device A with high input SNR and 2. a high-resource device B with low input SNR, as illustrated in Fig. 1.An example application is a smartwatch that collaborates with a distant smart speaker.Let x(k, t), n(k, t) be the perceptual domain representations of the speech and noise signal, respectively, at the frequency bin k and time frame t [16]; the perceptual domain representations are computed by dividing the frequency domain signals by the perceptual envelope obtained from the codec [16].These signals can be approximated by zero-mean Gaussian distributions with variances σ 2 x and σ 2 n , whereby the random variables are correspondingly X ∼ N (0, σ 2 x ), N ∼ N (0, σ 2 n ) [17].Under the assumption of uncorrelated, additive background noise, the noisy signal y(k, t) = x(k, t) + n(k, t) is Gaussian distributed with Y ∼ N (0, σ 2 y ), and variance σ 2 y = σ 2 x + σ 2 n [17].Our goal is to estimate the distribution of clean speech, conditioned over the noisy observation P (X | Y ), in other words, the posterior distribution [18].We obtain estimates for every time-frequency bin, and shall omit the time and frequency subscripts in the rest of the section to aid readability.According to the Bayes rule, the posterior distribution can be written as: where P (X) and P (Y ) are the prior distributions of the speech and observed signals and P (Y | X) is the conditional likelihood.However, our quantized observation, y q (k, t) of the noisy signal gives more evidence about X; The true value of the noisy signal Y lies within the quantization bin limits, y(k, t) ∈ [l(k, t), u(k, t)] and the lower and upper bin limits for the quantization levels in a frame {l, u} ∈ R K×1 are obtained from the observed quantized spectrum of a frame y q ∈ R K×1 [19].Since the true noisy signal lies in the bounded field l(k, t) ≤ Y ≤ u(k, t), we compute the summation of the likelihood over the quantization bin limits to obtain the posterior distribution of speech, where ∝ signifies equality up to a scaling factor.Eq. 2 can be rewritten as the difference between cumulative distributions, P (X) (l≤Y ≤u) ∝ P (X)(F (u) − F (l)).The conditional likelihood can be represented as P (Y | X) ∼ N (x, σ 2 n ), thus resulting in the final equation for the posterior distribution, where erf(.) is the error function.Note that due to the use of the exact quantization bin limits, P (X) (l≤Y ≤u) corresponds to a truncated Gaussian [20].This is in contrast to past works, where the quantization noise is approximated by an additive Gaussian distribution, which is an accurate approximation only at high bitrates [13].
From Eq. 3, the single channel posterior probability function (PDF) of the clean speech in spatial channel i is . Here we assume that the speech and noise energies at each channel are estimated in a pre-processing stage, for example, using voice activity detection and minimum statistics [21].Additionally, in order to focus on the advantage of the proposed enhancement approach, we assumed that the time-delay between microphones with respect to the desired sources was known at the decoder, whereby the signals from the microphones were synchronized.We shall include time-delay estimation within the enhancement framework in future work.Based on our setup, the environmental degradation and the bitrate are different for the two channels.Hence, we can assume that N i ∼ N (µ ni , σ 2 ni ) and the quantization-bin {l, u} i offsets are uncorrelated and independent between the two channels.Therefore, when conditioned on Y , due to conditional independence between the channels, the joint posterior PDF of speech over the network can be represented as , where M is the number of microphones in the WASN.The posterior PDF of speech in a two microphone network is thus: x−µs i σs i We obtain the multidevice signal estimate xMC , optimal in minimum mean squared error (MMSE) sense [18] by computing the expectation of the PDF obtained from Eq. 4. Due to the product of error-functions in Eq. 4, the expectation does not have a known analytical formulation.Therefore, we approximate the expectation of the PDF via numerical integration [22]; computing the Riemann sum using the midpoint rule over intervals n = 200 provided an approximate with sufficient accuracy in our experiments.Hence, the final equation is The system block diagram is depicted in Fig. 2 (a, b), where (a) is the overview of the entire system, from acoustic signal acquisition at the sensors to obtaining the time-domain estimate from multidevice signals.Note that the postfilter is placed at the fusion center, directly after the decoder, which provides the decoded perceptual domain signals to the postfilter.Fig. 2 (b) shows the internal structure of the postfilter.After receiving the quantization bin limits from the decoded signals, we compute the truncated Gaussian distribution for each channel, and then compute the joint posterior distribution as the product of the truncated distributions of the channels.The final point estimate, obtained as the expectation of the posterior distribution, yields the multidevice signal estimate.

Experiments and Results
To evaluate the performance of the proposed postfiltering approach, we determined the perceptual SNR (PSNR) and PESQ scores [16], and conducted a subjective listening test using MUSHRA [23,24].We considered two categories of degradation: 1. additive background noise, 2. background noise with reverberation.For the background noises, from the QUT dataset, we extracted the cafeteria scenario with babble noise [25].The clean speech samples were obtained from the test set of the TIMIT dataset [26].We encoded the noisy samples and applied the proposed postfilter to the decoded samples.Hence, the output signal is corrupted by both coding and environmental artefacts.To generate noisy speech with reverberation, we considered a room of dimensions (7.5 × 5 × 2)m 3 , with one speech source at coordinates (1, 2.5, 0.5)m and three noise sources placed at (6.5, 2.85, 0.5)m, (3.5, 4.5, 0.5)m and (6, 0, 0.5)m.The locations of the near and distant microphones are, respectively, (1.05, 2.55, 0.5)m and (2.25, 2.85, 0.5)m.An illustration of the setup is presented in Fig. 1.The signals at the microphones for the described acoustic scenario were simulated using Pyroomacoustics [27].
Let ρ and γ represent the PSNR and PESQ scores, respectively and the total bitrate is R.The postfilter is applied on the output of a codec that is specifically suitable for multi-device coding [15].For a fair evaluation, the single channel enhancement from Eq. 5 are used as baselines.Furthermore, we employ the conventional multichannel Wiener Filter (MWF) with diagonalized covariance matrix, to evaluate the advantage of the proposed method with respect to a conventionally accepted baseline [28].The notations and their definitions are as follows: 1. xMC is the multidevice estimate using device A at the bitrate = 1 4 R and device B at the bitrate = 3 4 R; the PSNR and PESQ scores of the estimate are ρ MC , γ MC respectively, 2. xBL_B is the baseline posterior estimate (from Eq. 5) at distant device B, encoding at full bitrate = R; ρ BL_B and γ BL_B are the objective measures, 3. xBL_A is the baseline posterior estimate (from Eq. 5) at device A using bitrate = 1 4 R, and ρ BL_A , γ BL_A are the objective measures, 4. xMWF is the multichannel Wiener filter using noisy signals from device A and device B, and ρ MWF , γ MWF are the objective measures.We show the advantage of the proposed postfilter over the baseline methods using differential PSNR and PESQ scores; their definitions are: 1.
The input SNR at device A was fixed to 40 dB and at device B, and we used a range of input SNRs ∈ {−5, 0, 5..., 30 dB}.From the test set of the TIMIT dataset, we randomly selected 100 speech samples (50 male and 50 female) and tested the postfilter over all the combinations of the bitrates, R ∈ {16, 24, 32, 48, 64, 80, 96kbps}, and the input SNRs for each speech sample.The objective results for the additive noise scenario are presented in Fig. 3 (a, c).ρ (MC−BL_A) , ρ (MC−BL_B) and ρ (MC−MWF) are shown in Fig. 3 (a) for the listed SNRs and the total bitrate ∈ {16, 32kbps}; We found that the PSNR of the proposed method was better than all three baselines over all SNRs and bitrates.For xMC relative to the single-channel estimate xBL_B , the highest differential PSNR is ρ (MC−BL_B) ≈ 22.5 dB.With respect to xBL_A , the highest ρ (MC−BL_A) ≈ 6 dB is obtained at 30 dB input SNR and 16 kbps.In addition, we observe that ρ (MC−BL_B) decreases with the increase in the input SNR at device B; also, it increases with an increase in total bitrate due to lower degradation from coding noise, specifically at device A. In contrast, ρ (MC−BL_A) increases with an increase in the input SNR at device B but decreases with increase in the total bitrate.In terms of PESQ, the largest differential PESQ for xMC relative to xBL_B is γ (MC−BL_B) ≈ 1.8 MOS, attained at −5 dB and 32 kbps.However, at 16 kbps and above 15 dB the negative MOS implied a decrease in quality.With respect to xBL_A , largest value is γ (MC−BL_B) ≈ 1.1 MOS at 30 dB input SNR at device B. Furthermore, the variations of γ (MC−BL_A) and γ (MC−BL_A) relative to the input SNR and bitrate follow similar trends as differential PSNR.Without exception, we observed similar trends for all the listed bitrates.The inverse variations of the differential scores with respect to xBL_A and xBL_B supports our expectation that the proposed postfilter optimally merges information from the two channels to obtain an enhanced multidevice estimate.
The test was repeated to include reverberation over a range of absorption coefficients, α = {0.1,0.3, ...0.9}.The results for R ∈ {16, 32kbps} are illustrated in Fig. 3 (b, d).While ρ (MC−BL_B) is positive for both bitrates over all the listed absorption coefficients, ρ (MC−BL_A) is consistently negative.One reason for this could be that while the postfilter reduces environment noise, as is reflected in the improvement with respect to xBL_B , it may introduce some speech distortion, or is unable to completely remove reverberation due to the lack of reverberation model, which shows as a drop in the PSNR with respect to xBL_A .Nevertheless, both γ (MC−BL_A) and γ (MC−BL_B) are positive over both the bitrates and all α, and follow similar variation trends as in the additive noise scenario.Lastly, the positive differential objective scores for both noise types with respect to the MWF indicate that the PSNR and PESQ gains of the proposed postfilter are larger than the gains obtained using the multichannel Wiener filter.This supports our informal observation that Wiener filtering is inefficient in capturing the essential features of speech signals.
The subjective MUSHRA listening test contained eight test items (4 male and 4 female), four of which included background noise with reverberation at α = 0.3 while the remaining items comprised of background noise only at SNR = 15 dB.Each test item consisted of five test conditions and the reference clean speech signal; a hidden reference and a lower anchor, which was the 3.5 kHz low-pass version of the reference signal, xMC , xBL_B , and xBL_A were presented as the test conditions; total bitrate was R = 32 kbps.As post-screening, we retained the responses from only those subjects that rated the hidden reference at more than 90 MUSHRA points for all items.Fig. 4 presents the consolidated differential MUSHRA, represented as η, from 13 participants who passed the post-screening; the boxplots show the median and interquartile range of η.The background noise with reverberation are presented in items {1, 2, 3, 4} and the background-noise-only samples are items {5, 6, 7, 8}.Items {1, 2, 5, 6} are female and the rest are male.η (MC−BL_A) was positive for all items, indicating that most subjects preferred xMC over xBL_A .With respect to xBL_B , the variations were found to be gender dependent.While the median η (MC−BL_B) points were positive for most male items (mean-M), they were negative for females (mean-F).Further analysis of the samples revealed that while background noise was attenuated in the xMC , speech distortions were introduced into the estimate and those distortions were more prominent in the female samples.This problem could potentially be addressed by using more informative speech priors, and modifying the signal model to incorporate the effects of reverberation.
To study the region of optimal performance of the postfilter, we analyzed the average γ (MC−BL_B) as a function of bitrate and input SNRs and absorption coefficient α; the resulting contour plots are depicted in Fig. 5.For the additive background noise scenario, gains are at higher bitrates and low input SNRs.Furthermore, the negative γ (MC−BL_B) over 20 dB input SNR and below 32 kbps implies that the postfilter performs sub-optimally in this region; in other words, we gain from a multidevice signal estimate when the additive degradation level is below 20 dB and the total bitrate is greater than 32 kbps.In the presence of reverberation, we observed that while the total bitrate had an impact on γ (MC−BL_B) , the improvement was fairly constant over the range of α at an arbitrary bitrate, and the improvement was positive over the considered input SNR range.This implies that the proposed postfilter can also be used to enhance signals degraded by reverberation and is not especially sensitive to the amount of reverberation, despite the fact that the signal model did not explicitly account for distortions from reverberation.

AFigure 1 :
Figure 1: Distribution of microphones in the ad-hoc acoustic sensor network.

Figure 2 :
Figure 2: Block diagrams showing (a) the overall system structure with the location of the postfilter, and (b) overview of the postfilter.

Figure 3 :
Figure 3: Illustration of differential PSNR and PESQ scores between the proposed multidevice estimate, and singlechannel baseline and multichannel Wiener filter at R = {16, 32 kbps} with 95% confidence intervals.ρ (MC−BL_B) and γ (MC−BL_B) are the differential PSNR and PESQ of the proposed multidevice estimate with respect to single-channel estimate of device B; ρ (MC−BL_A) and γ (MC−BL_A) are the differential scores of the multidevice estimate with respect to single-channel estimate of device A; ρ (MC−MWF) and γ (MC−MWF) are the differential scores of the multidevice estimate with respect to the multichannel Wiener filter.

Item- 1 Figure 4 :
Figure 4: Distribution of ∆MUSHRA points from the subjective listening test.η (MC−BL_B) and η (MC−BL_A) are the differential MUSHRA of multidevice estimate with respect to signal-channel estimates at device B and device A, respectively.Mean-F and Mean-M are the average differential scores over the female and males items, respectively.