Sound field reconstruction in rooms: inpainting meets superresolution

In this paper a deep-learning-based method for sound field reconstruction is proposed. It is shown the possibility to reconstruct the magnitude of the sound pressure in the frequency band 30-300 Hz for an entire room by using a very low number of irregularly distributed microphones arbitrarily arranged. In particular, the presented approach uses a limited number of arbitrary discrete measurements of the magnitude of the sound field pressure in order to extrapolate this field to a higher-resolution grid of discrete points in space with a low computational complexity. The method is based on a U-net-like neural network with partial convolutions trained solely on simulated data, i.e. the dataset is constructed from numerical simulations of the Green's function across thousands of common rectangular rooms. Although extensible to three dimensions, the method focuses on reconstructing a two-dimensional plane of the room from measurements of the three-dimensional sound field. Experiments using simulated data together with an experimental validation in a real listening room are shown. The results suggest a performance---in terms of mean squared error and structural similarity---which may exceed conventional reconstruction techniques for a low number of microphones and computational requirements.


I. INTRODUCTION
The functions describing sound propagation, such as sound pressure or particle velocity, operate scalar and vector values respectively which vary across the temporal and spatial dimensions. There exist many applications where knowledge of the spatial variation of the sound field is of paramount interest. For example, sound field navigation for virtual reality environments 1,2 , accurate spatial sound field reproduction over predefined regions of space 3-5 , or sound field control in reverberant environments 6,7 .
The different reconstruction scenarios are determined by the type of information gathered from the sound field. Depending on the type of acquisition, several techniques are used, ranging for example, from acoustic holography 8 , acousto-optic methods 9,10 , or traditional discrete sets of spatial samples 11 . The latter is particularly convenient in practice since it requires simple microphones.
In the case of sound field reconstruction in rooms, there exist several methods in the literature. In particular, model-based approaches based on samples of the sound pressure at a discrete set of locations tend to dominate the area. Results using classical sampling 11i.e. based on bandwidth analysis-build upon the image a) lluis-salvado@mdw.ac.at source method to characterize the sound field in a room in order to derive bounds on the aliasing error for a given sampling density. This leads to an impractically high density of microphones for an acceptable reconstruction error. Another approach to simplify the model and the number of measurements is based on parameterizing the room impulse response as a pole-zero system 12 .
Compressive sensing approaches have been effective in reducing the number of measurements compared to these previous methods. They inherently require an underlying assumption of the sparsity of the chosen room acoustics model. Utilizing the modal theory, it is possible to consider a plane wave approximation of the sound field 13 in a room in order to describe it spatially as a sparse linear combination of damped complex exponentials [14][15][16] . Dictionaries tend to be large, performance degrades at high frequencies, and the interpolated location should be, in general, in the far field with respect to the source. Under the image source method, estimation of the early part of the room impulse response is also possible assuming a few dominant image sources 17 . These techniques are in general sensitive to the choice of sampling scheme used in order to guarantee meaningful solutions and well-conditioned problems. Empirical methods for the latter are commonly adopted leading to some restrictions in the arrangement of microphones. Exploiting information about the modal frequencies may allow a more general microphone arrangement 18 at the expense of sensitivity to source location, modal density, and accurate modal frequencies estimation. Additionally, finding solutions to these sparse inverse problems is typically computationally demanding 19 .
In this paper, we adopt a data-driven approach to the problem of sound field sampling and reconstruction, which, for the present application, appears to be unexplored. For clarity of exposition, we focus on a twodimensional horizontal plane of three-dimensional rectangular rooms. We consider a very low number of irregularly and arbitrarily distributed measurements to recover the magnitude of the sound pressure in a room across the spatial dimension for the frequency range 30-300 Hz. The goal of the paper is aimed in three directions: use a very low number of microphones, accommodate irregular microphone distributions, and efficient inference.
We first view this field as a two-dimensional discrete signal. It can be interpreted that the acquisition step produces a low-resolution signal with missing samples. Then, the recovery step consists of filling the missing data of a high-resolution two-dimensional signal. We show how this process can be viewed as jointly performing inpainting 20,21 and superresolution 22,23 , both well-known techniques in image processing with a good performance using deep learning methods. In particular, we use a Unet neural network 24 with partial convolutions 21 trained on simulated data that simultaneously performs inpainting and superresolution. Under this framework, we show how it is possible to recover a high-resolution field from a very low number of microphones with low computational complexity in the inference process.
The paper is organized as follows: Section II establishes the conceptual framework under which the reconstruction problem is addressed, i.e. as a learning algorithm drawing upon inpainting and superresolution techniques. The details about the neural network architecture and the training procedure used for recovery are explained in Section III. Section IV presents results concerning the reconstruction accuracy of the proposed algorithm both in simulated and experimental settings-i.e. in real rooms.

II. PROBLEM DESCRIPTION
We frame the problem of sound field reconstruction within a data-driven approach, i.e. we aim at developing a recovery algorithm that directly and progressively learns from raw sound field data. The machine learning methods that have been particularly successful in this regard fall under deep learning systems. These have significantly outperformed model-based approaches in tasks such as, but not limited to, image classification, analysis, and restoration [25][26][27] ; or speech recognition and synthesis 27,28 .
The novelty of the present approach lies in the observation that the magnitude of the sound pressure in a room can be interpreted as a two-dimensional discrete function defined on a rectangular grid of points in space-i.e. in the same way a raster image is represented by a rectangular grid of pixels. This allows us to exploit the effectiveness of deep learning techniques in image processing. Although the principles governing the proposed algorithm can, in principle, be extended to three-dimensional regions, we focus on reconstructing the three-dimensional field in a two-dimensional plane for the sake of simplicity. We further assume that the enclosures of interest consist of rectangular rooms corresponding to domestic standards 29 .
In particular, the function that we sample and reconstruct is a discrete version of the magnitude of the Fourier transform of the sound field in a given frequency band. We show in the following how reconstructing this function is connected to the well-known concepts of image inpainting and superresolution. Let us first denote the spatio-temporal sound field in a three-dimensional rectangular room as p(r, t) where R = (0, a) × (0, b) × (0, c) for some a, b, c > 0 and r ∈ R. The magnitude of its Fourier transform is given by for ω ∈ R and r ∈ R. Initially, given a room, we can define the following rectangular grid as a set on an arbitrary two-dimensional plane, i.e.
for z o ∈ (0, c), i = 0, . . . , I, j = 0, . . . , J, and some integers I, J ≥ 2. Then, the available spatial sample points, denoted as S o , consist of a subset of D o . It is important to observe that there is no constraint whatsoever with regard to the pattern that S o has to form within D o . This allows us to have, for example, irregularly distributed spatial sample points within the room. For a given excitation frequency, the available samples can then be expressed as follows Note that the problem of interpolating s(r, ω) to the entire domain D o from known values in S o can be viewed as image inpainting-i.e. filling in the missing holes of a raster image. This is motivated by the irregular nature of the sampling pattern. However, we are interested in reconstruction on an even finer rectangular grid in order to capture the smallscale spatial variations of the sound field. In order to do so, we eventually interpolate the sound field to a grid of points corresponding to an upsampled version of the set D o , i.e.
where i = 0, . . . , IL, j = 0, . . . , JP , and some integers L, P ≥ 1. In the signal processing community, reconstructing a function on the domain D L,P o -high resolution signal-from knowledge of the function on D o -low resolution signal-is known as superresolution. placed under the inpainting and superresolution framework.
In summary, we aim at designing an estimator g w with the structure of a neural network where its parameters are real-valued weights w learned from simulated data. In particular, for a given set of frequencies of interest {ω k } K k=1 , the estimator is defined as follows The goal is then that the error is reduced for each frequency point. It is important to note that the available input values to the estimator are of dimension |S o | × K, thus the network should also be provided with information about the available sample values and their relative locations within D o . We address this in the following sections by means of a mask on the original grid. Interestingly, the relative distances among the microphone locations will not be needed as an input to the algorithm. Irrespective of the room dimensions, we assume that our algorithm accepts measurements from a rectangular grid-whose absolute size depends on the room size-in the same way an image reconstruction algorithm would learn to recover zoomed in or out input images.
We occasionally use tensors further below in order to represent function values on discrete spatial and frequency domains and as the data structure for the neural network operations. In particular, tensors, irrespective of their order, are denoted by bold uppercase letters, e.g. matrices can be denoted by A ∈ R n1×n2 for n 1 , n 2 ∈ N. Regarding function values, we interchangeably use the tensor representation. For example, consider {s(r, ω k )} r∈D L,P o ,k , then it possible to arrange its values into a tensor S ∈ R IL×JP ×K whose elements are given by

III. APPROACH
We propose a learning algorithm capable of estimating the magnitude of the spatial sound field, for a given frequency range, at a predefined number of locations based on very few measurements from irregularly distributed microphones. From the concepts introduced in the previous section, a practical scenario could be described as in Fig. 2. The microphones are assumed to provide the room transfer functions (RTFs) at those particular locations for a given frequency range. It is assumed that these microphones are located in a rectangular grid with a predefined number of points irrespective of the room size. Then, the prediction algorithm provides an estimate of the corresponding RTFs at the desired locations, e.g. Fig. 2 The approach is to train an artificial neural network that learns the structure of these sound fields from thousands of different examples of common domestic rectangular rooms.
The main parts of the algorithm, which we describe in detail in the following sections and illustrate in Fig. 3, can be briefly summarized as follows: • Dataset: we simulate three-dimensional sound fields, in the frequency band [30,300] Hz, for thousands of common rectangular rooms. The magni-tude of the pressure in the available spatial sample points S o serves as input to the network after a preprocessing step. The magnitude of the pressure in the finer rectangular grid, i.e. {s(r, ω k )} r∈D L,P o ,k , is then used to train the network in a supervised manner.
• Data Preprocessing: from {s(r, ω k )} r∈So,k , we generate a grid version, defined on D L,P o , consisting of the observed samples and a mask that encodes the information about the locations of these measurements. This preprocessing step involves completion, scaling, and upsampling operations.
• Neural Network: The architecture learns to predict a scaled version of the two-dimensional function {s(r, ω k )} r∈D L,P o ,k from the preprocessed observed sample values {s(r, ω k )} r∈So,k and the mask.
• Data Postprocessing: Estimates the appropriate scaling in order to restore the predicted values to the range of the source data.
The data and code of the proposed algorithm is freely available online 30 .

A. Dataset
We use the Green's function 31 to simulate point source radiation in 5 000 rectangular rooms. Room size and room proportions are randomly created following the standards specified in ITU -R BS.1116 -3 29 . For each room, the source is placed on the floor at a random location, i.e. (x o , y o , 0) xo∈(0,a),yo∈(0,b) . The magnitude of the sound field pressure is acquired in the finer rectangular grid D L,P o with L = P = 4, I = J = 8 and z o = 0. This essentially divides the room into a grid of 32 by 32 uniformly-spaced points. Given the domestic standards 29 , this corresponds to an average distance between grid samples of 14 cm in the x-axis and 26 cm in the y-axis. We analyze the results with 1/12th octave frequency resolution in the range [30,300] Hz. This gives K = 40 frequency points. The sound fields generated using this technique are referred to as ground truth sound fields, i.e. s GT (r, ω k ) := s(r, ω k ) for r ∈ D L,P o and k = 1, . . . K. A subset of s GT (r, ω k ) containing the observed samples captured by the microphones, {s GT (r, ω k )} r∈So,k , is used in the preprocessing part.

B. Preprocessing
This part addresses the processing stage necessary to handle the arbitrary nature of the sampling distribution. In particular, the raw input data is allowed to be variable in size and sampling location. In order to address this, we complete the input data to take values on D o . This is followed by a scaling operation in order to generalize the predictions for arbitrary sources and receivers. The actual information of where the samples are located within D L,P o is encoded into a mask-like function. An upsam-pled version of this processed input data together with this mask comprise the final input to the network.

Completion
We assume that the possible observed pressure values correspond to locations within the coarser grid D o , which also covers the whole room area. In this paper, the choice of parameters results in D o being a grid of 8 by 8 points with an average distance between samples of 56 cm in the x-axis and 104 cm in the y-axis. The samples observed are then given by {s GT (r, ω k )} r∈So,k . Irrespective of the structure of S o -i.e. the number and pattern of observed samples-, the neural network is designed so that the size of the input data is fixed. In order to address this, we introduce a function defined on D o that, in a sense, completes the acquired data, i.e.
for each ω k . In other words, for the locations where no samples are provided-i.e. no microphone is present-, s c is chosen to take the maximum value.

Scaling
We want the proposed method to be independent of the gain in the measurement equipment and the reproduction system. Thus, we introduce a scaling for the sample values s c in such a way that the range is restricted to [0,1], i.e. s s (r, ω k ) := s c (r, ω k ) − min r∈So s c (r, ω k ) max r∈So s c (r, ω k ) − min r∈So s c (r, ω k ) for each ω k . Consequently, the neural network will learn to predict the sound field values in [0,1]. A postprocessing stage will be added so that the predictions are restored to the original range.

Upsampling
Since we are interested in predicting values in the finer rectangular grid, D L,P o , we transform s s ∈ R 8×8×40 to a function s irr ∈ R 32×32×40 by means of an upsampling operation. In particular, we have that for each ω k . The original measurements are incorporated into s c , however, the actual input values to the network are given by s irr .

Mask generator
The function s irr does not provide any information about which values have been originally observed. Thus, we simultaneously generate a mask, defined on the finer for all ω k .

Input
The input data to the network consists of thirdorder tensors representing the frequency dimension and the two spatial dimensions, i.e. M ∈ [0, 1] 32×32×40 and S irr ∈ [0, 1] 32×32×40 . It is important to emphasize that the network performs convolutions considering the three dimensions in order to learn the relationships within and between frequency and space.
C. Neural Network

Architecture
We propose a U-Net-like deep neural network 32 with partial convolutions 33 in order to predict the magnitude of the sound field pressure in a room. The U-Net encoderdecoder structure can learn multi-resolution features of the sound field in the frequency-space domain, i.e. it can capture the sound field variations at different scales in both domains. This is carried out by the encoder downsampling by 2 the feature map at each layer and doubling the filter size. The decoder then reverses this procedure by upsampling the feature map and downsampling by 2 the filter size. Moreover, the decodervia concatenation-incorporates at the same hierarchical level the feature maps and masks computed by the encoder. In other words, the features from different resolutions in the frequency-space domain are also utilized as an input in the upsampling layers of the decoder. Finally, a 1×1 convolution projects the last feature map to generate the predicted sound fieldŜ p . Fig. 4 shows a schematic diagram of the architecture.

Partial Convolutions
Unlike traditional convolutions, partial convolutions 33 allow us to compute the output feature maps based solely on the available spatial sample points from the input feature maps. This provides the necessary flexibility to use any number of microphones at irregularly distributed locations. Let w be the sliding convolutional window with size k h ×k w . Consider further I w ∈ R k h ×kw×C and M w ∈ [0, 1] k h ×kw×C as correspond- ing to the C-channel input feature maps and mask within w respectively. The tensor W ∈ R k h ×kw×C ×C respresents the filter weights and b ∈ R C is the bias. Partial convolution computes each spatial location value o ∈ R C in the C -channel output feature maps as where sum(·) receives a tensor as an argument and provides the summation of its elements, is the Hadamard product, and · is a combination-in different dimensions-of matrix dot products and elementwise summations 21 . The scaling factor sum(1) sum(Mw) can be interpreted as a measure of the amount of known information in the input feature maps. Then, the mask M w is updated at each spatial location m ∈ R C as follows:

Loss Function
In order to train the model in a supervised manner, we also use a scaled version of the ground truth in order to be consistent with the output data before postprocessing.
The assumption is that this process may also assist the learning process. The scaling is given bȳ for r ∈ D L,P o and k = 1, . . . , K. It is clear then that s GT (r, ω k ) ∈ [0, 1].
As a loss function, we use two terms in order to distinguish between predicted values in the available spatial sample points S o and its complement under D L,P o . We first define and then where 1 ∈ R 32×32×40 with all entries equal to 1, and sum(| · |) acting on a tensor is the summation of the absolute value of its elements. The combined loss function finally takes the form The best performing loss weights were found after hyperparameter search on 1000 validation rooms.

Training Procedure
The model is trained in two different stages using supervised learning. We use 75% of the dataset for training purposes and the remaining 25% is used for validation. In the first stage, the learning rate is set to 2 · 10 −4 and batch normalization is enabled in all layers. For the second stage, the learning rate is set to 5 · 10 −5 with batch normalization disabled in all encoding layers. Training the model in multiple stages helps to overcome the error generated when computing, in the first stage, the mean and variance for all input values-corresponding to known and unknown locations-by batch normalization. In addition, faster convergence is achieved.

D. Postprocessing
We use linear regression to restore the output of the neural networkŝ p to its original range. Thus, the rescaled version takes the form s(r, ω k ) = a k ·ŝ p (r, ω k ) + b k (18) for all r ∈ D L,P o and k = 1, . . . , K, where the values a k , b k ∈ R are determined through the following optimization problem for each k = 1, . . . , K. Note that the rescaling operation could be implemented as another neural network that learns the mapping function. However, experiments showed that linear regression provided reasonable performance.

A. Evaluation Metrics
We use two different measures of performance for the proposed method. First, we consider the normalized mean square error (NMSE) computed for each frequency point, i.e.
The NMSE mainly provides an average absolute squared error over all locations between the reconstructed and the original signals. As a consequence, a high NMSE value may result from a poor performance locally while performing individually well in the remaining spatial locations. Therefore, we use the concept of mean structural similarity 34 (MSSIM) from image processing. This evaluates how the model predicts the overall shape of the pressure distribution for each frequency point. Moreover, it also provides a measure of performance that is independent of the scaling chosen. Let us first introduce the structural similarity index (SSIM) between two matrices A, B ∈ R n×n as follows where µ is the mean of the corresponding matrix entries, σ 2 the estimate of the variance of the entries, and σ AB is the covariance estimate between the entries of A and B. The constants c 1 = (h 1 R) 2 and c 2 = (h 2 R) 2 , where R is the dynamic range of the entry values, are meant to stabilize the division with a weak denominator. We set h 1 and h 2 to 0.01 and 0.03 respectively. In our scenario, we consider the individual matrices S k ∈ R IL×JP , i.e. the k-th matrix of tensor S ∈ R IL×JP ×K . Now, let {S n k (η)} N n=1 denote the set of all possible windowed versions of S k of size η × η. The mean structural similarity is then given by SSIM(S n k (η),Ŝ n k (η)) (22) for each frequency point. In the results presented, we have used η = 7.

B. Simulated Data
We asses the reconstruction performance of the proposed method-i.e. the generalization error-by using sound fields simulated in 30 different rooms. We are interested in evaluating the performance with regard to the number of irregularly placed microphones, denoted by n mic . Thus, given n mic , we analyze the reconstruction in each room placing the microphones in 10 000 different arrangements, i.e. each realization corresponds to a different S o . Figures 5 and 6 show, as a function of frequency, the average NMSE in dB and MSSIM for all rooms and locations tested and different number of available microphones.
Results show a general improved performance in sound field reconstruction as the number of available microphones is increased. At the same time, performance degrades as the frequency increases. This is in agreement with theoretical results that, given a maximum frequency content, require a higher sampling density for a more robust reconstruction and, given a reconstruction error, the sampling density constraints also increase whenever higher frequency content is available 11,35 . This suggests that the neural network capacity is subject to the same physical limitations as classical methods when learning the spatial variations of the pressure distribution, i.e. at high frequencies is hindered by undersampling and it also requires more observations to improve robustness. For example, the relative improvement as the number of microphones increase is higher at lower frequencies as opposed to the high-frequency range where more observations do not provide a big impact on performance. However, the requirements in terms of sampling density for a particular performance seem to be less stringent than other methods present in the literature. For example, only n mic = 5 microphones are able to provide an NMSE below −5 dB for the frequency range considered in common domestic rooms.
It is also important to observe that the L1 loss used during the training stage is suitable for prediction at low frequencies but it underperforms at high frequencies. This could be interpreted as the L1 loss resulting in predictions emphasizing the median value in order to reduce the overall error. This can explain, in the frequency range 100-300 Hz, the more abrupt changes in performance of the MSSIM as opposed to the NMSE.

C. Experimental Data
We test the model optimized for simulated data in a real listening room. The RTFs are estimated for two different source locations on a two-dimensional grid consisting of 32 by 32 points uniformly spaced along the corresponding dimensions. In particular, impulse response measurements were conducted from two 10" loudspeakers on a grid one meter above the floor in a rectangular room of dimensions 4.16 × 6.46 × 2.3 m. The measurements were performed using 4-second duration exponential sweeps from 0.1 Hz to 24 kHz at a sampling frequency of 48 kHz 36 . These measurements were performed with two microphones, each covering roughly half of the grid. The microphones were a Brüel & Kjaer (B&K) 4192 and a B&K 4133 1 2 " condenser microphone connected to a B&K Nexus conditioning amplifier and recorded with an RME Fireface UFX+ sound card. Both microphones were level calibrated at 1 kHz using a B&K 4231 calibrator prior to the measurements. The reverberation time of the room, specified as the arithmetic average of the 1/3 octave T 20 estimates 37 in the range of 32 Hz to 316 Hz, was 0.46 s. Similar to the previous scenario, we investigate the performance of the model with regard to the number of microphones placed in the room. We are particularly interested in assessing the performance when using very few observations. Thus, for each predefined source location, we also use here 5, 15, 35, and 55 microphones in 10 000 different arrangements and analyze the mean performance with a 95% confidence interval. These results are reported in Figures 7 and 8.
It is important to emphasize that the model to be evaluated with experimental data was trained using simulated data. Moreover, the horizontal plane, relative to the floor, in the experimental data does not correspond to its counterpart in the training data. It can be observed that, given n mic , the NMSE improves for decreasing frequencies as a general trend although there exist inconsistencies at a local level, i.e. adjacent frequencies may present abrupt changes in performance. The same interpretation applies to the MSSIM. In particular, there are two specific frequencies acting as outliers, i.e. 82 Hz and 157 Hz for the two different source locations. In this case, this is likely to be caused by the sources being positioned at nulls of the room modes. Fig. 9 depicts a representation of the magnitude of the sound field when the reconstruction is performed using only 5 microphones.
Despite the mismatch between the training and test scenarios, the network shows promising results under unseen data. Note that the simulated dataset used is usually an oversimplified model of the sound field behavior in a real room. It has been shown that the structure of convolutional neural networks already represents a prior conditioning the network to perform well in image-like signals 38 . The magnitude of the spatial sound field naturally fits the latter. Further, it can also be interpreted as a transfer learning 39 approach where the architecture itself helps to generalize well in the experimental scenario from weights only learned with simulated data.

D. Computational Complexity
Apart from the reduced number of microphones used, another advantage of the proposed method is the computational complexity regarding the inference operation. The training stage is usually time consuming, but it can often be run offline. The model size is relatively small with 3.9 million parameters resulting in a deterministic inference time of approximately 0.05 s on a Nvidia GeForce GTX 1080 Ti GPU-value estimated from 100 different room predictions.

E. Microphone Distribution
In our analysis, we have mainly focused on the performance based on the number of observations. However, we are also interested in studying the impact that particular microphone distributions have on the performance. Fig. 10 shows an illustration of the best and worst performing microphone distributions in terms of the NMSE. It can be observed that a better reconstruction at a specific frequency is achieved when the microphones capture the maximum variation of the pressure values. On the contrary, if the observations consist solely of the dip-like part of the room modes, the reconstruction degrades significantly. Evidently, this effect is frequency dependent, thus there is not a microphone setup that performs well across all frequencies. However, this also suggests that an unstructured microphone arrangement may be more likely to avoid these sampling issues caused by the modal structure.

V. CONCLUSIONS
In this paper, a deep-learning-based method for sound field reconstruction in rectangular rooms has been proposed and examined. The method jointly performs inpainting and superresolution in order to reconstruct the magnitude of the sound pressure in a two-dimensional plane of a three-dimensional room. The focus of this work is threefold: use a very low number of microphones, accommodate irregular microphone distributions, and effi-cient inference. The results suggest a performance which offers advantages in these three directions, e.g. even using 5 microphones arbitrarily placed the method provides an acceptable reconstruction error with a low inference time.
Regarding future work, the study of generative adversarial networks as discriminators may help to increase the performance at high frequencies. In addition, using more complex acoustic simulation models during the training stage could overcome performance inconsistencies at a local level as well as providing a lower generalization error when using experimental data.

ACKNOWLEDGMENTS
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 812719.