EFFECT OF MULTIPOLE DICTIONARY IN SPARSE SOUND FIELD DECOMPOSITION FOR SUPER-RESOLUTION IN RECORDINGANDREPRODUCTION

A sound field decomposition method using a multipole dictionary for super-resolution in recording and reproduction is proposed. Conventional methods are based on plane-wave decomposition; however, this procedure leads to severe spatial aliasing artifacts at frequencies above the spatial Nyquist frequency, determined by the intervals between microphones. We have proposed a sparse sound field decomposition method based on a model consisting of near-field monopole source and far-field plane-wave components. Although this method enables super-resolution in recording and reproduction, the use of only a monopole source dictionary is not appropriate in several cases because the region of the near-field source components is discretized as grid points. We propose the use of a multipole dictionary to accurately represent acoustically compact sources with complex directivity. We prove that this multipole dictionary is also effective for representing source components off the grid. Numerical simulations are performed to evaluate the proposed method.


Introduction
Sound field decomposition is a fundamental problem in sound field analysis, reconstruction, and visualization.The objective of sound field decomposition is to represent a sound field as a linear combination of fundamental solutions of the Helmholtz equation from pressure measurements.In sound field recording and reproduction, this makes it possible to calculate the driving signals of loudspeakers required for reproduction from the signals received by microphones, which is referred to as soundpressure-to-driving-signal (SP-DS) conversion.Plane-wave decomposition, which corresponds to spatial Fourier analysis of a sound field [1], has been commonly used because of its computational efficiency.In recent years, the sparse representation of a sound field has been proved to be effective in several applications, such as acoustic holography [2], source localization [3], and sound field recording and reproduction [4], owing to the recent development of sparse decomposition algorithms in the context of compressed sensing [5].We here consider a sparse sound field decomposition problem for recording and reproduction.
Sound field recording and reproduction is targeted at high-fidelity audio systems.Wave field synthesis (WFS) [6] is a sound field synthesis method based on the Kirchhoff-Helmholtz or Rayleigh integrals, which is generally applied to a planar or linear array of loudspeakers.Since it is necessary to obtain the distribution of the sound pressure gradient of the desired sound field on the secondary sources, WFS cannot be directly applied to SP-DS conversion.However, the wave field reconstruction (WFR) filtering method enables efficient SP-DS conversion [7], which has been extended to several array geometries [8,9].Higher-order ambisonics (HOA) is based on a sound field representation in a spherical or circular harmonic domain [10,11].In HOA, a sound field is generally analyzed and synthesized using spherical or circular arrays of microphones and loudspeakers, and it can be applied to SP-DS conversion through encoding and decoding processes [10].These methods can be classified as analytical approaches to recording and reproduction.
Another type of sound field reproduction method involves a numerical approach, which is typically based on controlling sound pressures at discrete positions in a target area [12].We define this method as the sound pressure control (SPC) method.The inverse of a known transfer function matrix between loudspeakers and control points is numerically calculated.Since only the desired sound pressures at the control points are required to obtain the driving signals of the loudspeakers, these methods can also be applied to SP-DS conversion.
As described above, the analytical approach to recording and reproduction is generally based on spatial Fourier analysis of the captured sound field.Although this procedure makes stable and efficient signal conversion possible, artifacts originating from spatial aliasing notably occur, depending on the interelement spacing between the microphones and loudspeakers.These artifacts cannot be avoided even when a numerical approach is applied.Owing to the significant effect of the spatial aliasing artifacts, listeners will not clearly localize reproduced sound images.Furthermore, the frequency characteristics of the reproduced direct sound are greatly affected.
We previously proposed a sound field decomposition method based on a model consisting of nearfield monopole source and far-field plane-wave components [4,13].Under the assumption that the source distribution inside the predefined near-field region of the microphones is spatially sparse, the captured sound field is decomposed into these two components using the sparse decomposition algorithm.This method makes it possible to improve the reproduction accuracy at frequencies above the spatial Nyquist frequency when the number of microphones is smaller than that of loudspeakers, which can be regarded as super-resolution in recording and reproduction.However, the use of only monopole source components as a dictionary matrix will not be sufficient to represent acoustically compact sources with complex directivity because the near-field region must be discretized as grid points.We propose the use of a multipole dictionary for sparse sound field decomposition.We prove that this multipole dictionary is also effective for representing source components off the grid.Numerical simulations are conducted to compare the methods using monopole and multipole dictionaries.

Generative model of sound field and its sparse decomposition 2.1 Sound field consisting of source and plane-wave components
We briefly revisit the generative model of the sound field proposed in Ref. [4].As shown in Fig. 1, we divide a sound field in the recording area into internal and external regions, i.e., near-field and far-field regions, of a closed surface.The internal region is denoted as Ω.Sound pressures are measured inside Ω using multiple microphones.It is assumed that source components only exist inside Ω.When the sound pressure of temporal frequency ω at position r is denoted as p(r, ω), the following equation should be satisfied: where Q(r, ω) is the distribution of the source components inside Ω and k = ω/c is the wave number obtained by setting the sound speed as c.Hereafter, ω is omitted for notational simplicity.Equation (1) indicates that p(r) satisfies the inhomogeneous and homogeneous Helmholtz equations at r ∈ Ω and r / ∈ Ω, respectively.The solution of Eq. ( 1) can be represented as the sum of the inhomogeneous and homogeneous terms, p i (r) and p h (r), respectively.Here, p i (r) is represented as the convolution of Q(r) and the three-dimensional free-field Green's function G(r|r ′ ) as [1] where Here, G(r|r ′ ) corresponds to the transfer function between the monopole source at r ′ and the position r.Since it is assumed that sound sources do not exist outside Ω, the homogeneous term p h (r) can be represented as the sum of plane waves [1].Therefore, we define these two terms p i (r) and p h (r) as source and plane-wave components, respectively.
If the captured sound field can be represented as Eq. ( 2), the driving signals of the secondary sources used to reproduce the sound field can be uniquely obtained [4,13].The source component p i (r) is reproduced with model-based sound field synthesis methods [14,15].The WFR filtering method can be applied to reproduce the plane-wave component p h (r) [7,8,9].As an example, we here consider planar distributions of receivers and secondary sources.The sound pressure distribution is obtained on the x-z plane at y = 0.The plane-wave component p h (r) can be described using the spatial frequency spectrum on the receiving plane, P h (k x , k y ), as where k x and k z respectively denote the spatial frequencies with respect to x and z, and All the source and plane-wave components are assumed to be in the region of y < 0. When secondary sources are also arranged on a plane and are assumed to be monopole sources, the driving signals of the secondary sources can be obtained as [4] This equation can be regarded as a combination of the synthesis of the source components Q(r) by WFS [14] and the WFR filtering of p h (r) [7].If the decomposition of the captured sound field into source and plane-wave components is achieved, the dominant component will lie in p i (r); therefore, d(r) obtained by Eq. ( 5) will become more accurate than that obtained on the basis of plane-wave decomposition, particularly at high frequencies [4].

Sparse decomposition based on discrete model
A major problem is how to decompose the observed sound pressures into the source and planewave components as in Eq. ( 2).We apply sparse decomposition algorithms under the assumption that the distribution of the source components inside Ω is spatially sparse.
To address the sound field decomposition problem (Eq.( 2)) as a sparse representation problem, the region Ω is discretized as a set of grid points.Omnidirectional microphones are discretely arranged inside Ω to capture the sound pressures.The numbers of microphones and grid points are denoted as M and N , respectively.We assume N ≫ M because the grid points should entirely and densely cover the region Ω.The discrete form of Eq. ( 2) can be represented as where y ∈ C M and x ∈ C N respectively denote the signals received by the microphones and the distribution of the source component at the grid points, z ∈ C M is the plane-wave component, and D ∈ C M ×N is the dictionary matrix of the source components whose elements consist of Green's functions (Eq.( 3)) between the grid points and the microphones.We assumed that x is the dominant component of y and z is the residual.Since it can be assumed that only a few source components exist in Ω, a small number of elements in x have nonzero values.Therefore, the decomposition of y into x and z can be achieved by solving the following optimization problem: where ∥x∥ p (0 < p ≤ 1) is the ℓ p -norm of x, which corresponds to the penalty term used to induce sparsity of the solution x, and λ is a parameter that balances the approximation error and the sparsityinducing penalty.Several algorithms for solving (7) have been proposed [16].Although Eq. ( 6) represents the signal model of a single frequency bin and single time frame, it is possible to exploit several group-sparse models arising from the physical properties of the sound field [13].

Multipole dictionary to reduce discretization error 3.1 Theoretical interpretation of the use of multipole dictionary
By discretizing Ω into grid points, the dictionary of monopole sources will not be sufficient to represent acoustically compact sources with complex directivity, particularly when the true source location is not exactly on the grid.We show that the use of a discrete monopole dictionary can be regarded as the approximation of the Green's function with a zeroth-order Taylor expansion.Therefore, the approximation error of the discretization of Ω can be reduced by using higher orders.We propose a sparse sound field decomposition method using a dictionary matrix consisting of multipole sources, which corresponds to the use of higher orders of the Taylor expansion.
We separate the region Ω into a set of small regions Ω n (n ∈ {1, • • • , N }) as in Fig. 2. The location of a representative point of Ω n is denoted as r n .Then, the source component p i (r) is given by We here consider the approximation of G(r|r ′ ) by a Taylor expansion with respect to r n as follows: where l x , l y , and l z are referred to as the expansion order.This equation corresponds to the multipole expansion of the Green's function.By truncating the expansion order of G(r|r ′ ) up to L x , L y , and L z , Eq. ( 8) can be approximated as The expansion coefficients for Ω n including the sources will only have non-zero values.Since Q(r) is assumed to be sparse, this leads to a group-sparse structure in the discrete model as described in the next section.When L x , L y , and L z are set as 0, p i (r) becomes a linear combination of monopole sources, as in Eq. ( 6), by using the grid points at r n .

Sparse decomposition with multipole dictionary
We describe a discrete model based on Eq. ( 10) including group sparsity with respect to multiple time frames and frequencies as in Ref. [13].Using the subscripts indicating the indexes of time frame t ∈ {1, • • • , T } and frequency f ∈ {1, • • • , F }, the following equation can be obtained: where l (∈ {1, • • • , L}) is the index of the multipole components combining l x , l y , and l z .The dictionary matrix D f,l consists of the lth pole of the Green's function between r m and r n of the f th frequency as where G f (•) is the Green's function (Eq.( 3)) of the f th frequency.By combining the signals and redefining the variables as and z ∈ C M T F , they can be still related by a linear equation as in Eq. ( 6).Each x t,f,l should have the same sparse structure.Decomposition by exploiting this group-sparse structure can be achieved by solving the following optimization problem:  Here, J p,2 (x) is a groupwise diversity measure of x defined as where x ∈ C T F L is the nth vector component of each group and x n,t,f,l is an element of x n .The optimization problem in Eq. ( 13) can be solved by using the iteratively reweighted leastsquares algorithm proposed in [13], which can be regarded as an extension of the FOCUSS algorithm [16,17].The source and plane-wave components are separately converted into the driving signals of the loudspeakers as in Eq. ( 5).The sum of these signals is the final output of the loudspeakers used for reproduction.

Experiments
Numerical simulations were performed using linear arrays of microphones and loudspeakers.We compared the proposed method, the method only using the monopole dictionary, and the WFR filtering method, which are denoted as "Proposed", "Mono", and "WFR", respectively.
A linear microphone and loudspeaker arrays were set along the x-axis with the center at the origin in the recording and target areas, respectively.The number of microphones M was 32 and the intervals between the microphones were 0.06 m.The number of loudspeakers was 48 and their intervals were 0.04 m.Therefore, the spatial Nyquist frequency of the microphone array was about 2.8 kHz and that of the loudspeaker array was about 4.3 kHz.The array lengths of both the microphones and loudspeakers were 1.92 m.The directivity of the microphones and loudspeakers was assumed to be omnidirectional.
In Proposed and Mono, the two-dimensional region Ω was set to be a square of 2.4 × 2.4 m 2 centered at (0.0, −2.0, 0.0) m on the x-y plane at z = 0.The number of grid points was 25×13 and they were set at intervals of 0.1 m in the x direction and 0.2 m in the y direction.The number of time frames T was 8.The number of positive-frequency bins F was set as 128 and the sampling frequency was 16 kHz.For the multipole components, a monopole and a dipole for x and y were used for the dictionary D; therefore, L was set to 3. In Mono, only the dictionary of the monopole was used.
The sound pressure distributions in the target area were simulated in a 2.1 × 2.4 m 2 region on the x-y plane at z = 0 with the center at (0.0, 1.0, 0.0) m.The intervals between the simulated positions were 0.015 m.The amplitudes were normalized using the averaged squared amplitude in the region y ≥ 0.5 m.The general reproduction accuracy was evaluated using the signal-to-distortion ratio for where porg (•) and prep (•) are the original and reproduced sound pressure distributions in the time domain, respectively, (x i , y j ) denotes discrete positions in the simulated region and t k denotes discrete time.The total number of time samples was set to 160, i.e., the duration was 10 ms. Figure 3 shows the relationship the frequency of the source signal and SDRR when a point source was located at (-0.32, -0.84, 0.0) m.Note that the primary source was not exactly on a grid point.To choose λ for Proposed and Mono, we calculated SDRR from 0.001 to 0.01 at intervals of 0.001 for λ at 4.0 kHz.Then, λ that gives the highest SDRR was chosen for each method.The SDRRs of the three methods remained high up to the spatial Nyquist frequency of the microphone array.In WFR, the SDRR sharply decreased above this frequency.The SDRR of Mono deteriorated but was maintained to some extent above the spatial Nyquist frequency.On the other hand, the SDRR of Proposed remained high up to 4.3 kHz, which is approximately the spatial Nyquist frequency of the loudspeaker array.
To demonstrate the effect of the model of multipole components for the off-grid case, the SDRR when a single primary source was inside a square region of 0.4 × 0.4 m 2 with the center at (-0.2, -1.0, 0.0) m is shown in Fig. 4. The SDRRs were calculated at intervals of 0.01 m inside the region by the three methods and the frequency of the source signal was 4.0 kHz.The cross marks indicate the locations of the grid points.The SDRRs of WFR were low regardless of the location of the primary source.Although the SDRRs of Mono were high when the primary source was located around a grid point, they decreased between the grid points.High SDRRs were maintained over the entire region for Proposed; therefore, the multipole dictionary is effective for off-grid cases.

Conclusion
A sparse sound field decomposition method using a multipole dictionary was proposed.In the method previously proposed by the authors, the near-field region was discretized as grid points and a sparse decomposition algorithm was applied, where the dictionary of the source components consisted of the transfer function of the monopole sources between the grid points and the microphones.However, the monopole dictionary is not sufficient to represent acoustically compact sources with complex directivity.We proposed the use of a multipole dictionary for sparse sound field decomposition.This multipole dictionary is effective for representing source components off the grid, which can be interpreted as the use of higher orders of the Taylor expansion of the Green's function.Numerical experiments were also conducted to show the effectiveness of the multipole dictionary.

Figure 1 :
Figure 1: Generative model of sound field consisting of near-field source and far-field plane-wave components.

Figure 2 :
Figure 2: Discretization of Ω to treat sound field decomposition as sparse representation problem.

Figure 4 :
Figure 4: Relationship between source location and SDRR.Cross marks indicate grid points.