Multiple scattering ambisonics: three-dimensional sound field estimation using interacting spheres

Rigid spherical microphone arrays (RSMAs) have been widely used in ambisonics sound field recording. While it is desired to combine the information captured by a grid of densely arranged RSMAs for expanding the area of accurate reconstruction, or sweet-spots, this is not trivial due to inter-array interference. Here we propose multiple scattering ambisonics, a method for three-dimensional ambisonics sound field recording using multiple acoustically interacting RSMAs. Numerical experiments demonstrate the sweet-spot expansion realized by the proposed method. The proposed method can be used with existing RSMAs as building blocks and opens possibilities including higher degrees-of-freedom spatial audio.


Introduction
Audio is indispensable in immersive technologies such as mixed reality (MR) and virtual reality (VR), which are receiving much attention. For these applications, it is essential to develop technologies to capture, process, and render spatial sound fields with high precision for the presentation of truly realistic and immersive MR/VR experiences. Ambisonics [1] as well as higher-order ambisonics (HOA) [2], which are established spatial audio frameworks to capture, process and reproduce spatial sound fields based on its representation in the spherical harmonics domain, are receiving much attention due to the popularization of MR/VR platforms [3,4], and its high compatibility with first-person view MR/VR. Ambisonics spatial audio capturing and processing consists of a microphone array and signal processing that is used to encode the raw microphone array signal to the spherical harmonics-domain spatial description format, which is referred to the ambisonics signal. This ambisonics signal is decoded to the signal which is fed to loudspeaker arrays to render the spatial sound field. Such loudspeaker arrays are often virtualized by means of binaural technologies [5,6,7] and played back via headphones. Hence the high compatibility of ambisonics with MR/VR applications that usually use headphones for audio playback. Due to its formulation in the spherical harmonics-domain, a typical implementation of an ambisonics recording device is employing a spherical microphone array (SMA) [1,8,2,9,10]. Often, SMAs are mounted on soundhard spherical scatterers in order to avoid the instability arising in encoding filters for hollow microphone arrays due to singularities originating from the roots of the spherical Bessel function [2], and for its mechanical stability as hardware. This form of a SMA is referred to as a rigid SMA (RSMA). Despite its success in first-person immersive audio with only three degrees-of-freedom (DoF) which are associated with the rotation of the listener, ambisonics suffers from the diminishing size of the accurate reconstruction area, referred to as the sweetspot, as the frequency increases, hence limiting its efficacy in higher DoF spatial audio reproduction allowing translation of the listener. This is visualized in Fig. 1 (left), showing the resulting reconstruction sweet-spots for incident plane waves with various frequencies. Here, the sweet-spot is defined as the region where the signal-to-distortion ratio (SDR) of the estimated field with respect to the ground truth incident field is above 30 dB. In order to expand the sweet-spot of ambisonics reproduction, the simplest way is to develop RSMAs with larger number of microphones. Although this is an effective approach, it comes with a significant development and device cost. An alternative approach is to combine multiple existing RSMAs and integrate the captured information. However, this is not a trivial task due to the inter-array interference. Here, multiple scattering higher-order ambisonics (MS-HOA), a three-dimensional (3D) sound field capturing scheme using multiple RSMAs with fully considering inter-array interaction due to multiple scattering [11] is proposed. Numerical experiments show that MS-HOA successfully creates sound field representations with expanded sweet-spots even when the RSMAs are densely arranged with small spacing, which is not achieved without the consideration of inter-array interaction. An example sound field recording and reproduction setup allowing translation of the listener is illustrated in Fig. 1 (right).
2 Conventional ambisonics encoding using a single RSMA The conventional framework of ambisonics encoding using a single RSMA is briefly reviewed. Ambisonics encoding and decoding can be performed by either relying on solving a linear system using least squares [2] or relying on spherical harmonic transformation using numerical integration [12]. Since the first approach allows more flexibility of the microphone array configuration, this approach is adopted here. In the present work, all formulations are presented in the frequency-domain, which can be converted to time-domain representations by inverse Fourier transform. All individual microphone capsules are assumed to be omnidirectional. The spherical harmonics used are defined as with θ and ϕ the polar and azimuthal angle, respectively, and P m n (x) and P n (x) respectively the associated and regular Legendre polynomials: (2) The above definition of spherical harmonics provides an orthonormal basis: with δ ij the Kronecker delta. The process of obtaining the ambisonics signal A m n (k), the weights of the spherical basis functions of the three dimensional sound field representing an arbitrary incident field of wavenumber k, from the signal captured by the microphone array is referred to as ambisonics encoding. An arbitrary incident field can be expanded in terms of the regular spherical basis functions j n (kr)Y m n (θ, ϕ) of the three-dimensional Helmholtz equation in the spherical coordinate system (r, θ, ϕ): with j n (x) the spherical Bessel function of degree n. The total field p tot , which is the sum of the incident field and the field scattered by a rigid sphere with radius R located at O, the origin, is given by: with h n (x) the spherical Hankel function of the first kind with degree n [13].
On the surface of the rigid sphere, i.e. r = R, this total field is evaluated as: The total field captured by the q-th microphone on the surface of the RSMA located at (R, θ q , ϕ q ) is therefore given by: By truncating the infinite series with n = N c , this result can be represented in the following vector form: where p tot is a vector holding p (q) tot in its q-th entry, A is a vector holding A m n (k) in its (n 2 +n+m+1)-th entry, and Λ is the matrix holding i (kR) 2 h n (kR) Y m n (θ q , ϕ q ) in its (q, n 2 +n+m+1) entry. The goal of ambisonics encoding is to obtain A m n (k) for all ns and ms up to the truncation degree n = N (in) c , i.e. 0 ≤ n ≤ N c and |m| ≤ n, from the observation p tot . This problem can be solved by regularized least squares with a minimization objective: with σ a regularization parameter, and the solution given by: where E ≡ (Λ H Λ + σI) −1 Λ H is the regularized encoding matrix.

Proposed method
In the proposed method, a grid of multiple RSMAs is used to estimate p in (4). The goal of ambisonics encoding in MS-HOA is to estimate A m n (k) (4), for 0 ≤ n ≤ N c and |m| ≤ n from observations of the sound pressure at discrete microphone capsule positions mounted on the surfaces of multiple RSMAs. In the following, a system of N S ≥ 2 RSMAs where each RSMA has a radius a s is considered. Here, s is the index of the RSMA. Hereafter, the argument k is omitted from A m n (k).

The forward problem: multiple scattering due to an arbitrary incident field
It is known that the problem of multiple scattering in a system of multiple spherical scatterers, i.e. computing the scattered field p scat given p in and the configuration of the scattering spheres, can be solved analytically [14,11]. This problem is referred to as the forward problem. The procedure of solving the forward problem is briefly described here. First, A m n , the expansion coefficients at O (4) truncated at degree n = N  in its (n 2 + n + m + 1)-th entry. S is referred to as the system matrix, which is a block matrix holding the inter-sphere translation operator T (s,t) S|R from the t-th sphere to the s-th sphere in its off-diagonal (s, t)-block and the "single scattering matrix" Λ (s) in its diagonal blocks: where Λ (s) is a diagonal matrix holding − h n (kas) j n (kas) in its (l, l) entry with l = n 2 + n + m + 1. The translation operators T S|R can be computed by various methods, including explicit expressions based on Clebsch-Gordan coefficients or Wigner 3-j symbols [15], or methods based on recurrence relations [16]. The total field p tot evaluated at r (s) q , the q-th microphone position belonging to the s-th RSMA, is the sum of the scattered field contributions from all the RSMAs and the incident field p in : where R m(s) n (r Alternatively, p in (r (s) q ) could be evaluated directly using A m n instead of the translated A m(s) n coefficients. The whole procedure of the forward problem can be expressed by a linear operator T F which is referred to as the forward operator: where p tot is a vector holding the values of p tot (r (s) q ).

The inverse problem: MS-HOA encoding
The matrix representing T F can be constructed by applying the operator to all bases up to n ≤ N (in) c . The estimate of the incident field can then be obtained via regularized least squares: where A (est) is a vector holding the estimated coefficients A m(est) n in its (n 2 + n + m + 1)-th entry up to n ≤ N (in) c and T I ≡ (T H F T F + σI) −1 T H F is the encoding matrix for MS-HOA with σ a regularization parameter. The scheme of the forward and inverse problem is summarized in Fig. 2.

Numerical experiments
MS-HOA recording and encoding into HOA coefficients was validated by numerical experiments. Grids of RSMAs where each individual RSMA is a 252channel SMA mounted on a rigid spherical scatterer with a radius of 8 cm are considered. The spherical Fibonacci grid [17,10] of 252 points was used for the microphone capsule positions. A real-world implementation of a 252-channel RSMA with a similar size has been demonstrated in the past [9]. As the RSMA grid, a linear grid of 6 RSMAs and a regular Cartesian grid of 9 RSMAs was used in the experiments. The spacing between the nearest neighbour RSMA was set to 25 cm. The sound field generated by a monopole source located at r s = (10m, 10m, 10m) was used as the incident field. The signal captured by the grid of RSMAs was encoded into the HOA coefficients A (est) with the proposed method (MS-HOA). While prior works on the forward problem report heuristics for choosing the parameter N (fwd) c , e.g. N (fwd) c = eka [14], here N (fwd) c was treated as a free hyper-parameter. The case where inter-sphere interaction is switched off, i.e. the method which only considers single scattering (Single), and the case of conventional HOA encoding using only one building block RSMA (HOA) are also computed as baselines. The analytical reconstruction of the estimated incident field was computed by (4) and was compared to the ground truth incident field p in in terms of the SDR and the size of the reconstruction sweet-spot area (SSA) measured in the xy-plane or the yz-plane depending on the configuration of the RSMA grid. The SSA is defined here as the total area where the SDR surpasses 30 dB, which is measured using the regular Cartesian grid points on a plane which correspond to the pixels in Fig. 3-Fig. 4. The regularization hyperparameter was optimized by grid search independently for both the Single baseline and the proposed MS-HOA. Regularization was not applied to the single sphere HOA baseline due to its minor effect for this case, while the truncation number N c was chosen as the one providing the largest SSA for the given RSMA. The results for the linear 6-sphere RSMA grid with a incident field frequency of 4kHz are shown in Fig. 3. The sweet-spot of reconstruction is successfully expanded with MS-HOA while the SSA and SDR is significantly degraded if only single scattering is considered. The results for the regular Cartesian 9-sphere grid is shown in Fig. 4, demonstrating planar expansion of the sweet-spot.

Related work and discussion
Multiple scattering ambisonics, a method to capture 3D sound fields using multiple acoustically interacting RSMAs, was proposed. MS-HOA allows to integrate the information captured by multiple densely arranged RSMAs and can be used to expand the reconstruction sweet-spots in 3D sound field reproduction. The numerical experiments demonstrated that the proposed method successfully captures spatial sound fields with expanded reconstruction sweet-spots which was not possible without the consideration of inter-array interaction due  to multiple scattering.
A related method using the translation of multipoles was introduced in [18]. This method was based on the assumption that the SMAs do not physically interact with each other, i.e. the SMAs do not cause scattering that affect other SMAs. This assumption is violated if the SMAs are densely arranged RSMAs, which scatter the incident field and interact with each other by multiple scattering. As shown in the numerical experiments, the approach without considering inter-array interaction becomes inaccurate if the RSMAs are arranged with small spacing. Recently, the consideration of inter-array multiple scattering has been demonstrated to improve the reconstruction accuracy in a two-dimensional sound field reconstruction problem using multiple cylindrical microphone arrays [19]. Two-dimensional modeling, however, is insufficient for modern spatial audio applications such as MR/VR where 3D audio representation and rendering is essential. Our work enables the use of interacting rigid microphone arrays for 3D spatial audio.
The expanded reconstruction sweet-spots with linear or planar spreads realized by the proposed method could be useful in applications including sound field reproduction in theaters or in meeting rooms where the sweet-spot should cover multiple listeners sitting next to each other, or higher DoF MR/VR where the translation of the listener needs to be supported. Developing techniques to reduce the cost of MS-HOA recording in terms of hardware, computation, and bandwidth is important for practical applications and are subjects of future research.