Compressible flow simulations of voiced speech using rigid vocal tract geometries acquired by MRI

Powered by TCPDF (www.tcpdf.org) This material is protected by copyright and other intellectual property rights, and duplication or sale of all or part of any of the repository collections is not permitted, except that material may be duplicated by you for your research use or educational purposes in electronic or print form. You must obtain permission for any other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not an authorised user. Schickhofer, Lukas; Malinen, Jarmo; Mihaescu, Mihai


I. INTRODUCTION
Voice production and the expression of sounds via speech are crucial human traits and essential for our communication. The primary source of this process is the oscillation of the vocal folds, leading to a pulsation of the sub-and supraglottal pressure. This causes periodic pressure fluctuations in the vocal tract, which are propagated into the far field through the mouth opening.
The human voice production is therefore often successfully modeled as a source-filter arrangement with the glottis acting as an acoustic source and the vocal tract as a filter (Fant, 1971;Stevens, 2000;Titze and Alipour, 2006). As a result of reflections, constructive and destructive interference of acoustic waves, the signal exiting the upper airways is modulated in such a way that certain distinct tonalities appear in the spectral envelope of acoustic pressure fluctuations. Various parts of the upper airways produce resonant patterns at different frequencies, leading to pronounced peaks in the far-field acoustic spectrum. These peaks are referred to as formants, and their position in the Fourier spectrum characterizes the produced sound. In the case of vowels, particularly the first two formants, F 1 and F 2 , identify the spoken tone. During speech, the air volume of the vocal tract cavities is continuously manipulated through muscular flexion and extension, allowing the vocalization of various sounds through articulation. Through phonation, the harmonic component of the spoken sound, as a result of the periodic vocal fold modulation of the glottal airflow, is the dominant soundgenerating mechanism at lower frequencies (Klatt and Klatt, 1990). In this range it overshadows any inharmonic noise component that occurs due to flow turbulence. However, application of the quasisteady approximation for aerodynamic sound shows that the acoustic contribution of flow structures and turbulence becomes important at frequencies higher than 2 kHz (Howe and McGowan, 2007;Zhang et al., 2002). The acoustic contribution in the high-frequency range of several kilohertz is due to the fact that the vortical structures present in the supraglottal region are small and of quadrupole source type that becomes negligible compared to the dominant sources of monopole-and dipole-types at lower frequencies (Zhao et al., 2002). Zhang and Neubauer (2010) therefore hypothesized that variations in the supraglottal flow field barely impact low-frequency acoustics. Although the unique tonal signature of a spoken sound is primarily determined by the first two formants, the higher harmonics and frequencies are responsible for the personal quality of the voice (Sundberg, 2005). For a complete understanding of voice production covering the full range of audible frequencies and dominant modes between 20 Hz and 20 kHz, both harmonic and inharmonic factors of broadband type must therefore be taken into account. Disturbances introduced by the flow instabilities need to be accurately computed if the high-frequency part of the voice is to be considered. Moreover, it is unclear to what extent both intra-and supraglottal flow structures contribute to the overall acoustic signal and, thereby, to phonation in general (Zhang, 2016).
Computational fluid dynamics (CFD) of phonation has emerged as a valuable tool to treat such questions. Indeed there exists no method to spatially and temporally resolve the flow and pressure fields in the vocal tract during human phonation in vivo. Experiments both on elastic, selfoscillating glottal physical models (Murray et al., 2014;Neubauer et al., 2007;Syndergaard et al., 2017;Triep and Br€ ucker, 2010), as well as on ex vivo canine larynx models (Khosla et al., 2007;Oren et al., 2014) give clear indications on the complex, unsteady, three-dimensional (3D) flow field arising in the supraglottal region. These studies show effects such as secondary flows, recirculation regions, asymmetric attachment of the glottal jet to one of the airway walls, flow instabilities, and transition to turbulence. The shedding of vortices and downstream convection of instabilities was observed to increase during the closing phase of the glottal cycle or while the vocal folds take on a divergent shape. Haigermoser (2009) and Koschatzky et al. (2011) used near-field flow data from particle image velocimetry (PIV) to estimate the contribution of the acoustic source terms for rectangular cavity flows. Lodermeyer et al. (2018) consequently applied a similar approach on PIV data of the downstream region of a self-oscillating, physical vocal fold model and calculated the acoustic source terms of Lighthill's equation. All these observations support the conclusion that any accurate description of flow and acoustics in the vocal tract needs to be sufficiently well-resolved in time and space to capture these phenomena that are excluded from traditional phonation models based on the Bernoulli equation.
CFD has been applied within the biomechanics of phonation to fill that gap by computing the unsteady flow field and correctly predicting the flow characteristics (de Luzan et al., 2015;Mihaescu et al., 2010). Simplified geometries of the glottal region have been used in these studies to investigate the frequency content of vorticity generated downstream of the glottis. Sidlof et al. (2015) and Z€ orner et al. (2016) used CFD simulations for the calculation of nearfield aerodynamic fluctuations, which were then imposed on the acoustic far-field through Lighthill's analogy and acoustic perturbation equations. Thus, they were able to link the aerodynamic sources in the glottal area to the acoustic region.
Recently, magnetic resonance imaging (MRI) measurements can be used for the analysis and setup of anatomically realistic computational geometries for numerical simulations of the flow inside the human upper airways. Story et al. (1996) used MRI data for the extraction of cross-sectional area functions of the vocal tract during phonation of steady vowel sounds and thereby measured the internal resonant air volume. Blandin et al. (2015) and Arnela et al. (2016) could show that for frequencies below 4-5 kHz vocal tract simplifications show only weak influence on the acoustic response due to the predominant plane wave propagation in the lowfrequency range. However, for higher-order modes, they found a strong dependence on the geometrical representation of the vocal tract. It was found that only detailed, 3D geometries of the vocal tract are able to correctly emulate the high-frequency behavior. Realistic geometries can properly replicate the physical and physiological conditions inside a human vocal tract and mimic its behavior as an acoustic filter to give accurate low-and high-frequency tonalities in the far-field sound spectrum and their associated sound pressure levels (SPLs). Furthermore, it has been shown that numerical simulations in the upper airways provide valuable insight into the physical consequences a surgical intervention might have on a patient (Mylavarapu et al., 2013). CFD has become a promising method to assess the effectiveness of invasive measures to alleviate medical conditions such as obstructive sleep apnea through computation of parameters such as the airway resistance and the pressure drop along the pharyngeal domain (Mihaescu et al., 2011). In speech production, numerical flow simulations based on MRI data have the potential to motivate clinical decision-making for the therapy of vocal disorders. An accepted theory with voiceless sounds, whispered vowels, and fricatives is that the noise generated through the turbulent glottal jet and its interaction with the surrounding airway constitutes the source for sound (Stevens, 2000). For patients whose sustained vocal fold oscillation is inhibited due to paralysis, paresis, or cancer, one can use this information to enable whisper-like sounds via treatments. A major motivation for this work and research on the biomechanical and acoustical aspects of the human upper airways is therefore not only the advancement of knowledge about flow-induced voice generation, but also the large prevalence of vocal disorders. Voice disorders account for up to 25% of all occupational diseases (Sliwinska-Kowalska et al., 2006). About 30% of the common population are estimated to be affected by a voice production disorder at some point in their life with an increased incidence of more than 60% among professions with extensive voice use, such as teachers and professors (Mittal et al., 2013).
The goal of this study is an accurate computation of the whole audible frequency range of voiced speech under articulation of natural vowels. This requires resolution of both the harmonic component due to the regular pressure fluctuations by glottal modulation, as well as the inharmonic broadband component due to flow instabilities and turbulence, which is typically neglected by traditional models of voice acoustics based on wave equations. In order to capture possible high-frequency cross-modes, 3D MRI geometries are applied.

A. Geometry and mesh
The computational geometries for the Finnish vowels and [u] were acquired by MRI from a 30-yrold, healthy male subject following the experimental arrangements described by Aalto et al. (2011). The imaging was carried out using Siemens Magnetom Avanto 1.5T scanner (Siemens Medical Solutions, Erlangen, Germany) running 3D VIBE (volumetric interpolated breath-hold examination) MRI sequence (Rofsky et al., 1999).
The subject was firmly aligned by soft material inside the head and neck MRI coil in order to allow only movement of the muscles involved in the voice generation process (cf. Fig. 1). Using the scanner settings detailed in Aalto et al. (2014, Sec. 3.2), the imaging of a single vowel geometry took ca. 8 s using 1.8 mm isotropic voxels. Due to restrictions of the MRI technology, the computational geometries do not include teeth. The subject produced prolonged vowels at the fundamental frequency f o ¼ 104 Hz for ca. 12 s in supine position. In addition to the reference pitch signal, the subject could also hear his own, analogically denoised utterance with a delay of approximately 90 ms through the standard Siemens MRI headphones (Siemens Medical Solutions, Erlangen, Germany). During MRI, the acoustic noise level within the scanner is ca. 90 dB (SPL), and the headphones provide 26 dB nominal damping of this external noise. The vowel production time exceeds the time required for MRI in order to guarantee the stability of phonation during imaging and obtain a speech sample right before and after the imaging that is free of acoustic scanner noise.
The surface model of the air-tissue interface was extracted from 3D MRI voxel data by an earlier version (Aalto et al., 2013) of the automatic vocal tract segmentation software published by Ojalammi and Malinen (2017). The 3D numerical mesh of the fluid domain is unstructured and consists of polyhedral cells ranging from the airway channels to the far-field domain. The geometry in the intraglottal region is the result of averaging through the MRI recording, as the vocal fold motion is too fast to be resolved in time. Nevertheless, the whole domain is anatomically accurate and no geometrical modifications have been applied. Additional local mesh refinement has been applied in the intra-and supraglottal region, where velocity and pressure gradients in the flow are expected to be largest close to the shear layers of the glottal jet (cf. Fig. 2). Three consecutive refinement stages are present in the supraglottal region with a decrease of the base cell size by 20% at each stage. The grids for the different vowel articulations used in this study have a resolution with an approximate cell count of 7.5 Â 10 6 based on the grid convergence study detailed in Sec. II F. The chosen mesh sizes are significantly above the minimum requirement of the solver of 40 cells per wavelength for the highest relevant frequencies of phonation, and are therefore suitable for extracting the important acoustic information. The airflow is considered as compressible and modeled in the simulations by the ideal gas law p ¼ qRT, relating gas pressure p and density q via the specific gas constant of air R ¼ 287.06 J/(kgK) at room temperature T % 300 K.

B. Vowel formant and Helmholtz resonance data
Two kinds of comparison data were used to validate the results of the CFD computations: formant frequency data extracted from the speech sample recorded during the MRI and data obtained through solution of the Helmholtz eigenvalue problem based on the same computational geometry (cf. Sec. II F). Following the notation introduced by Titze et al. (2015), the formants are generally denoted by F i , with index i ¼ 1,…,n. Unless explicitly stated otherwise, the formant data F i in this study has been extracted through the unsteady flow simulations. As explained by Aalto et al. (2011), noise free samples were recorded inside the scanner right before and after the imaging using a two-channel acoustic sound collector and waveguides. The frequency response of the instrumentation was numerically compensated from the speech samples, and the formant analysis was carried out by linear predictive coding (LPC) in MATLAB (MathWorks, Natick, MA). The peaks of the resulting spectral envelope were inspected manually, and values for F 1 and F 2 were determined by comparisons, when necessary, with corresponding speech data that were recorded from the same subject in an anechoic chamber (Aalto et al., 2012, Table II). The lowest eigenvalues from the Helmholtz problem were computed by the finite element method (FEM) with piecewise linear shape functions and tetrahedral meshes. A Dirichlet boundary condition is used at the mouth, leading to discrepancy with the values of F 1 and F 2 measured from speech as discussed in Aalto et al. (2014, Sec. 5.2). The resonance frequencies corresponding to F 1 and F 2 can be found in Aalto et al. (2012, Table I).

C. Glottal flow and boundary conditions
Various waveforms have been suggested as a realistic model of the glottal source (Fant et al., 1985;Fujisaki and Ljungqvist, 1986;Rosenberg, 1971). For a comparative analysis of several different waveform models the reader is referred to the work by Fujisaki and Ljungqvist (1986). In the current study, the Rosenberg waveform f R (t) ¼ f p (t) þ f n (t) of the glottal pulse is applied to account for the dynamic behavior of the vocal folds (Rosenberg, 1971). It models the volume velocity through the glottal constriction and consists of the function for its positive slope and the function for its negative slope with the pulse amplitude A 0 . The relative duration of the glottal opening t p with respect to the glottal period t 0 is referred to in the following as glottal opening quotient. The glottal pulse with the parameter values used for this study is shown in Fig. 3. These are the fundamental frequency of f o ¼ 120 Hz, characteristic for an average male voice, and the glottal opening quotient of t p /t 0 ¼ 0.5525. The ratio of the time in which the glottis is opening to the total pulse length is t p /(t p þ t n ) ¼ 0.85. The resulting open quotient (t p þ t n )/t 0 ¼ 0.65 of this study lies in the range of typical values for voiced speech (Mittal et al., 2013). The pressure amplitude at the glottal entrance is chosen to be the mean subglottal pressure P 0 ¼ P s % 300 Pa. This value is at the lower end of typical physiological subglottal pressures of voiced speech and coincides with a relatively low glottal frequency of the waveform and a volume velocity amplitude of A 0 ¼ 2.2 Â 10 À4 m 3 /s (0.22 l/s) in Eqs.
(1) and (2) (Titze and Alipour, 2006). The vocal tract is assumed static and rigid. The interface between airway walls and airflow is modeled as solid walls with a Dirichlet boundary condition, u i ¼ 0, referred to as no-slip boundary condition, for the velocity components of the fluid. Free stream and acoustically non-reflecting boundary conditions are imposed at the inlet of the tubular vocal tract (i.e., at the subglottal end) and at the outlet surfaces of the computational domain located about 15 cm away from the mouth in the acoustic region, where the fluid characteristics are specified by the Mach number, atmospheric pressure, and room temperature.

D. Numerical method
For a direct computation of both the velocity field in the vocal tract and pressure fluctuations in the near and far field, compressible flow simulations using the finite volume method (FVM) are performed in this work. Thus, the behavior of the fluid is described by the Navier-Stokes equations in compressible form. Body forces like gravity are not considered, as their effect on airflow in the upper airways can be neglected. With the current study, handling turbulence by means of the large eddy simulation (LES) approach, the large, energy-containing scales are resolved down to the Taylor microscale in the inertial subrange. The resolution down to scales in the inertial subrange is an important feature, in particular when acoustic pressure fluctuations are of interest. LES applies a filtering operation to the flow variablẽ wðr; tÞ Ð V wðr 0 ; tÞGðr;r 0 Þdr 0 by use of the filter function Gðr;r 0 Þ ¼ Gðr Àr 0 Þ that acts as a convolution kernel. Thus, the quantity can be decomposed into a filtered part and a subfiltered one. The cutoff value of resolved scales is dependent on the numerical cell size with the subgrid scales of the flow being low-pass filtered by the computational grid. Therefore, the filter width is directly related to the spatial resolution of the mesh. The unresolved scales of the flow below that value are computed by the wall-adapting local eddy-viscosity (WALE) subgrid scale model after Nicoud and Ducros (1999), which relates the subgrid scale viscosity to the grid filter width, density, and resolved velocity field. For the time-dependent computations of this study, the spatially filtered Navier-Stokes equations are calculated for a numerical mesh size small enough to accurately handle the complex geometry of the upper airways and resolve the most important energy-containing flow structures.
Following the approximation by Pope (2001), the Taylor microscale is estimated as with a kinematic viscosity of air of % 1.79 Â 10 À5 m 2 /s, a glottal width of d % 2.5 Â 10 À3 m, and a maximum axial velocity of w max % 21.98 m/s. The integral length scale l ¼ 0.1d is assumed to be one order of magnitude smaller than the characteristic length scale of the geometry in the glottal region, which is the glottal width. Furthermore, the turbulence intensity is estimated at 10%. With the established value of the Taylor microscale in Table I, the spatial dimensions of the grid are chosen to meet its order and are consequently refined toward the supraglottal region in steps of 60%, 40%, and 20% of the maximum volume size of 9.62 Â 10 À4 m in the far field. This way we ensure that the criterion of the Taylor microscale is met by the grid resolution in the whole domain. Additional characteristic parameters of the flow are given in Table I. Their values lie in the range of a typical male voice (Stevens, 2000). Grid spacing Dx and time step Dt are chosen to satisfy the Courant-Friedrichs-Levy (CFL) criterion. The convective Courant number CFL c ¼ u i Dt/Dx and the acoustic one CFL a ¼ c 0 Dt/Dx are related via the Mach number CFL c / CFL a ¼ u i /c 0 ¼ M. Since the Mach number of the considered flow regime is small (M ( 1), the convective Courant number becomes the critical parameter for the choice of grid and time step sizes to capture all the aerodynamic fluctuations in the source region (CFL c ( CFL a ). The maximum time step of the computations is chosen as Dt max ¼ 20 ls, sufficiently small to resolve the full Fourier spectrum of frequencies relevant for phonation. The near-wall regions are resolved accordingly such that a non-dimensional, perpendicular wall distance of y þ ¼ ðyu Ã Þ= ¼ ðy ffiffiffiffiffiffiffiffiffiffi s w =q p Þ= 1, depending on the wall shear stress s w and the kinematic viscosity , is obtained. The fluctuations in velocity and pressure are computed in a coupled manner. The convective flux is computed by spatial discretization via a second-order bounded centraldifference scheme. Integration in time is performed via an implicit scheme and with a temporal discretization of second order consisting of an average of ten inner iterations to ensure convergence within each time step. The mean flow fields are extracted with a statistically significant sample size of more than 50 full glottal cycles and after any initial spurious reflections in the numerical domain have vanished. The high number of cycles is chosen to resolve also the lowfrequency component of the acoustic spectrum below the source frequency. The fluctuations in the flow variables are then computed by subtracting the mean values from the total instantaneous values, w 0 ¼ w À w.

E. Frequency-wavenumber spectrum
For an analysis of disturbances and their propagation velocities, two-dimensional Fourier transformation of the fluctuating pressure is performed along the centerline in the near-and far-field domains [see, e.g., Nussbaumer (2012)]. The time-space Fourier transform of data is computed aŝ with k 2 K and x 2 [0,1]. Furthermore, the phase velocity of disturbances is defined as c u ¼ fk ¼ f2p/k. Since the wavenumber k is subsequently plotted in the range [Àp,p], we use c u ¼ fp/k. Thus, the group velocity of disturbances can be computed via c g ¼ p@f/@k, with the slope of a frequencywavenumber diagram @f/@k.

F. Validation and verification
The LES approach has been initially validated for a simplified geometry of the intraglottal domain based on the M5 model introduced by Scherer et al. (2001), which is well defined through surface design equations. The results for the intraglottal pressure computed numerically are shown in Fig.  4 for both the flow and non-flow wall depending on the asymmetric deflection of the glottal jet toward one side. There is an overall very good agreement between the predicted and experimental pressures for all three transglottal pressures considered. The definition of flow and non-flow wall, respectively, has been the same as in the original study by Scherer et al. (2001). The largest deviations of the numerical results from the experimental reference data are observed for the flow wall at a transglottal pressure of 10 cmH 2 O with 0.009%, 1.1%, and 7.8% for the relative minimum, mean, and maximum values, respectively. The smallest deviations occur for the non-flow wall at 5 cmH 2 O with 0.001%, 0.8%, and 2.4% for the relative minimum, mean, and maximum values, respectively. Acoustic validation of the applied method with respect to the resolved formants is performed as well. For this purpose, using the numerical approach and the boundary conditions stated in Secs. II C and II D, flow and acoustics data corresponding to the vocal tract geometries relevant for the three vowels [A], [i], and [u] are calculated. Fouriertransformed signals of predicted far-field pressure fluctuations in a monitoring point P outside of the vocal tract (as depicted in Fig. 1) are compared with (i) resonance frequencies computed from the Helmholtz eigenvalue problem, based on the 3D wave equation solved in the same vowel geometries, (ii) experimental formant data extracted from speech recordings carried out simultaneously with the MRI recordings. Fig. 5. The dominant spectral envelope peaks from the simulations of the unsteady flow match with the frequency values obtained from the Helmholtz equation and speech recordings. Particularly the first formant F 1 is captured well for all vowel geometries. The second formant F 2 shows good agreement with the experimentally obtained formants, except for the vowel [i] in Fig. 5(b), where the formant does not appear as a pronounced peak in the farfield spectrum. There is a narrow constriction caused by the tongue position in the high front vowel [i], and it is likely to produce turbulence and large fluctuations in the flow, leading to high broadband noise that overshadows the peak at F 2 . However, a peak of the curve of local maxima is found at a frequency of $1900 Hz, which most likely corresponds to the formant F 2 and is plotted together with the formant data of all other vowels in Fig. 11. That the Helmholtz resonance data and the experimental formant corresponding to F 2 of [A] do not match is due to the overly simple boundary condition in the Helmholtz equation as discussed in Aalto et al. (2014, Sec. 5.2). Overall, the acoustic information is well captured through the direct compressible flow simulations and agrees with the experimental reference data. The solver used for this study has also been applied to study aeroacoustics related to fluidstructure interaction at low Reynolds numbers (Schickhofer et al., 2016).

The result can be seen for the three vowels [A], [i], and [u] in
Furthermore, the numerical solutions to the field variables are verified by a grid sensitivity study based on five sizes varying from a coarse mesh of 10 5 cells to a fine one of approximately 7.5 Â 10 6 cells for the geometry of vowel [A]. The grid convergence index (GCI) and discretization errors are calculated for the chosen grid sizes according to the procedure described by Celik et al. (2008), which has been shown to be equally valid for non-structured grids as the ones used for this study. The averaged relative GCI along a centerline through the glottis and supraglottal regions is  and [u] in (c). The Fourier spectrum of the acoustic pressure fluctuations obtained from the LES simulations is shown with the peak envelopes, the discrete values of the formant frequencies F 1 and F 2 obtained through experiments, and the Helmholtz resonance frequencies. Uncertainty of the experimental data is indicated by the grey regions. GCI p ¼ 1:7% for the mean pressure and GCI u ¼ 9:7% for the mean velocity of the finest grid. For the current study, the highest grid resolution of the sensitivity study was chosen for all analyzed vocal tract configurations.

A. Scales of flow fluctuations
The resolved frequency range for velocity and pressure fluctuations throughout the vocal tract is investigated by Fourier transformations of the time history data obtained at discrete points as indicated in Fig. 6. The pressure fluctuations, p 0 ¼ p À p, and the velocity fluctuations, u 0 i ¼ u i À u i , are computed by subtracting the mean values from the total instantaneous values. The locations of the probe points are chosen such that P1 is positioned close to the glottal constriction, while P2 and P3 are located in the laryngopharynx, or supraglottal domain, respectively, where flow instabilities close to the glottal jet and transition to turbulence in its shear layers are likely to occur. The probe P4 is well inside the oral cavity, while the far-field probe is located at a distance of 8.5 cm from the mouth opening.
The scaling of velocity and pressure distributions depends on the physical nature of the flow and whether there are free-stream conditions or boundary and shear layers present (George et al., 1984). Therefore, a change in the underlying scaling law is observed for spectra at locations away from the aerodynamic source region (cf. Fig. 7). Probe signals of the pressure fluctuations in the near field show a steeper slope of roughly f À11/3 , whereas spectra at probe positions in the supraglottal region and toward the acoustic far field follow f À7/3 . The f À11/3 component arises from shear interaction in the flow, while the f À7/3 component dominates where the approximation of free-stream conditions and turbulence-turbulence interaction holds. The velocity fluctuations in the near field follow roughly the scaling of f À5/3 , where the approximation of isotropic turbulence holds. Moving away from the source (i.e., glottis), one can observe deviations from this scaling law as can be noticed, e.g., for the velocity spectrum at location P3, located in the neighborhood of a larger rotating flow structure of low velocity.
Additionally, constrictions and expansions of the narrow airway, as well as the shear layers of the glottal jet are likely inducing anisotropy to the flow, which shows in deviations of the velocity spectra at locations P2 and P3 from the scaling of f À5/3 for the case of isotropic turbulence in the inertial subrange. Furthermore, the spectra tend to exhibit clear peaks of the source frequency and its higher harmonics up to roughly 1000 Hz at all considered probe positions. This can be seen by comparing the investigated spectra at the various positions in Fig. 7. The amplitude of the glottal frequency is damped out throughout the upper airways as the distance from the glottis is increasing, and becomes less pronounced outside of the vocal tract in the acoustic domain (cf. Fig. 12).

B. Flow instabilities
The resulting flow through the glottis under realistic conditions of the surrounding airway shows secondary flow patterns and strong recirculation regions in the supraglottal domain. This behavior is visualized in Fig. 8  Mm. 2. Animation of the axial velocity component in the sagittal plane section during a full glottal cycle. This is a file of type "gif" (18.1 MB).
Mm. 3. Animation of perpendicular vorticity component and isosurfaces in the coronal plane section. This is a file of type "gif" (18.6 MB).
Mm. 4. Animation of perpendicular vorticity component and isosurfaces in the sagittal plane section. This is a file of type "gif" (23.8 MB).
In order to establish the energy content and spectral features of disturbances in the near field and outside of the vocal tract in the far field, frequency-wavenumber spectra are shown in Figs. 10(a) and 10(b). Dashed lines are indicating the acoustic cone, which consists of lines at a slope corresponding to the speed of sound. Thus, any pressure disturbances lying on this cone can be considered purely acoustic. While the highest energy contribution in the far field is contained inside the acoustic domain of the spectrum at low wavenumbers, the near field shows the presence of disturbances propagating at approximately uniform velocity downstream of the vocal tract. The group velocity of these disturbances in the source region, indicated by their relative slope in the frequency-wavenumber spectra, is computed from the simulation data as c g ¼ 9.8 m/s. This agrees well with the theoretical result from linear stability analysis of the flow subjected to perturbations growing in time [see, e.g., Panton (2006)]. For a first approximation, we consider nearly parallel inviscid mean flows in the x-direction withũ ¼ ð u þ u 0 ; v þ v 0 ; w 0 Þ and p ¼ p þ p 0 and subjected to arbitrary disturbances composed of normal modes, and p 0 ¼pðyÞe iðaxþbzÀactÞ ; with the wavenumbers a and b in x-and z-direction, respectively, and the complex wave speed c ¼ c R þ ic I . The normal-mode disturbances are thus traveling waves of amplitudes that depend on the y-direction, perpendicular to the mean flow direction. Under these assumptions, the condition for stability is c I < 0. Additionally, the dispersion relation of the system can be solved for the angular frequency x and the exponent of the temporal growth rate e ac I t becomes Furthermore, the phase velocity of the fastest moving disturbances caused by Kelvin-Helmholtz instabilities is predicted as half of the mean velocity of the jet, and thus $11 m/s for the considered conditions of this study with the mean velocity u of the glottal jet in streamwise direction. Following Squire (1933), any unstable 3D disturbance has a corresponding two-dimensional disturbance that is more unstable. Thus, the above approximation for propagating instabilities holds also for the 3D cases considered in this study.
Cuts through the frequency-wavenumber diagrams reveal the energy content at discrete frequencies [cf. Figs. 10(c) and 10(d)]. The broadband noise component is significantly larger in the near field than in the far field at all considered tonalities f 0 , F 1 , and F 2 . Moreover, the inharmonic fluctuations become more prominent toward higher frequencies of 1000 Hz and above.

A. Formants
The considered vocal tract geometries for the production of the vowels [A], [e], [i], [o], and [u] are investigated in terms of their resonance frequencies (i.e., formants), by computing the spectra of pressure fluctuations in the acoustic domain. Figure 11 gives an overview of the results for F 1 and F 2 . The overall trend of formant frequencies from high to low vowels, as described by Stevens (2000), is well captured by the current direct approach and agrees with the experimental data. From the high vowel [i], over the intermediate vowel [e], to the low vowels [A], [o], [u], the formant F 2 shows a steady decrease. An exception is the frequency F 2 for vowel [e], which is slightly higher than for vowel [i] and possibly due to overestimation from the numerical spectrum. Formant F 1 on the other hand shows an increase toward the intermediate and then further decrease toward the low vowel range. Figure 12 displays the resulting power spectral density for the near-and far-field pressure fluctuations of the vowel [A] at various locations throughout the vocal tract. It can be recognized from Fig. 12 that the critical region of sound amplification in cases of the first formants F 1 and F 2 is the lower domain of the airways, and they are equally dominant in all spectra due to standing waves resonating in the vocal tract volume. The third formant F 3 , however, is less pronounced relative to F 1 and F 2 at locations P1 to P3. Toward P4 and outside the airway tract, a clearer and more significant mode develops close to the approximate location of F 3 . It is possible that the accumulation of flow perturbations over the length of the airways contributes to this effect. As such, unsteady flow might not only cause measurable increase in broadband noise, but also significantly add to the energy content of higher formant modes. The sudden increase of the relative energy content at frequencies in the range of formant F 3 in the far-field domain is probably due to sound radiation from the mouth. According to Stevens (2000), the mouth opening can be considered as a simple source radiating uniformly in all directions as good approximation for frequencies up to 4 kHz. This leads to a radiation characteristic of magnitude jRðf Þj ¼ f q=ð2rÞ at a distance r. Thus the sound radiation rises with frequency and introduces a positive increasing spectral tilt of 6 dB per octave under $4 kHz, which would explain the energy shift toward higher frequencies in the far field of Fig. 12. The frequencies from near-field turbulent flow can be distinguished in the broadband region. Although smaller in amplitude due to dissipation, the noise component of the spectrum, which is induced by the turbulence and its associated quadrupole sources, is detected outside the internal airways. As expected, the overall amplitude of power spectral density continuously decreases throughout the upper airway channels toward the mouth opening.
While the harmonic, low-frequency component of the spectrum up to about 1000 Hz shows a fairly small decrease from position P1 to P4 inside the upper airways, the magnitude of fluctuations in the inharmonic, high-frequency range above 1000 Hz is reduced significantly. Thus, the broadband sound component in Fig. 12 is decreased with increased distance from its source in the supraglottal region, while the formant modes can be detected as peaks of nearly constant amplitude throughout the vocal tract.

B. Fourier surface spectra
Fourier analysis of the surface pressure fluctuations at the vocal tract walls highlights the areas of amplification through resonance, as well as acoustic surface sources generated by the unsteady pressure loads on the airway walls caused by the intermittent flow interacting with the walls. The surface acoustic spectra shown in Fig. 13 and [u] in order of tongue position from high to low vowels. The SPL at the surface is normalized by its maximum value for each of the vowel geometries to allow for better comparison. This is a way to visualize the impact of the various parts of the airway geometry on the production and filtering of the vowel formants. Because it is based on total fluctuating pressure at the walls, the surface Fourier transformation takes into account the acoustic fluctuations through resonance effects and standing wave patterns, as well as the fluctuations caused by near-field flow motion.
As a general trend it can be seen that lower formant frequencies are amplified predominantly in the laryngopharynx, oropharynx, or back cavities. The oral cavity seems to further contribute to the amplification of formant F 1 in the configurations of the lower vowels  [u].
The mode at F 2 tends to be related to higher sound pressures in domains of smaller spatial dimensions such as the oral and nasal cavities (i.e., the front cavities) in the airway  Fig. 6). The dominant frequencies F 1 , F 2 , and F 3 are indicated.
of vowel [A]. In the case of the vowel [u], the formant F 2 is related predominantly to the back cavity of the pharynx.

V. CONCLUSIONS
In this study, the mechanisms of sound pressure propagation and amplification through realistic upper airway geometries of several natural vowels were investigated through means of LES computations. By applying the Rosenberg glottal pulse model for the pulsating flow and pressure at the level of the vocal folds, the acoustic source signal was propagated throughout the upper airways and subsequently modified by the airway volume. Both the harmonic, low-frequency component, as well as the inharmonic, high-frequency component of pressure fluctuations arising during phonation are thus computed directly in the near and far fields of the computational domain.
The particular case of vowel [A] was considered with respect to the production and convection of Kelvin-Helmholtz instabilities in the supraglottal region, which shows a sharp increase during the glottal closing phase and propagates through the vocal tract at a maximum phase velocity equal to half the glottal jet velocity (cf. Sec. III B). The Reynolds number in the flow reaches a maximum at the peak of the waveform with fast acceleration of the flow through the narrow glottis short before the closing phase starts. A rise of vortical structures shed from the glottal constriction is observed during the closing phase. Flow structures develop particularly in the supraglottal jet shear layers due to flow instabilities caused by large velocity gradients. The pressure fluctuations outside the acoustic cone are larger in amplitude for higher frequencies and their slope in the frequency-wavenumber spectrum follows the estimated group velocity from linear stability analysis.
Flow separation and vortex shedding are known to take place in the intraglottal region, especially for a divergent glottal shape (Mihaescu et al., 2010). Vortices and instabilities might therefore propagate from the intraglottal region into the supraglottal region and contribute to the near-field pressure fluctuations. This effect, which is due to the viscous airflow and particular divergent shape of the vocal folds during closing, is similarly reproduced in our time-averaged glottal geometry (cf. Figs. 6, 8, and 9).
The frequency contribution and energy content of the supraglottal flow structures at higher frequencies of 1000 Hz and above might bleed into higher formant tonalities of F 3 and above (cf. Sec. IV A) and thus impact not only the broadband noise component at very high frequency but also the position and intensity of higher formants relative to lower ones. In the current study, the acoustic contribution of broadband sound to the voice spectrum is resolved to the upper end of the audible range. The high-frequency component of the sound spectrum is important for voice quality and the personal characteristics of a voice, referred to as timbre (Sundberg, 1994). It is also of relevance for professional singers, whose voices commonly exhibit a frequency peak of high energy in the spectral envelope at about 3000 Hz, referred to as singer's formant (Sundberg, 2001). Furthermore, Kreiman and Gerratt (2012) pointed out the perceptual importance of the inharmonic broadband component to voice quality. The noise-toharmonic ratio (NHR) was found to be an important indicator of general intelligibility and breathiness of a voice (de Krom, 1995;Hillenbrand, 1988).
It could be further shown that both spectra of near-and far-field pressure perturbations have significant peaks at the dominant frequencies F 1 and F 2 (cf. Figs. 5 and 12 Localized resonances of individual formants have been studied using predominantly one-dimensional acoustic models. Fant (1971) was the first to find cavity affiliations of vowel formants by simplifying the vocal tract as a system of tubes or Helmholtz resonators. Thus, the formants could be interpreted as resonances of the various cavities of the vocal tract. Consequently, Fant and Pauli (1974) used the onedimensional Webster's horn equation to extract the spatial distribution of kinetic and potential energies at particular resonance modes inside the vocal tract configurations of various vowels. In the current study, the specific areas of amplification of the formants are identified by means of Fourier transformation of the near-field pressure fluctuations at the airway surfaces (cf. Sec. IV B). They give 3D information on the relevant areas of sound generation in the upper airways that cannot be obtained through one-dimensional models. Furthermore, they entail both sound due to resonance and surface pressure fluctuations caused by impacting flow, which is not captured by models based on wave equations.
Two factors are essential for the acoustic characteristics of a healthy voice: the voice source, which requires functioning vocal folds, on the one hand, and the vocal tract with parts of the nasal tract acting as resonators on the other hand (Sundberg, 2005). If parts of the vocal tract are affected by paralysis (e.g., of the tongue), tumors, or tissue inflammation, its ability to produce acoustic modes in the far-field spectrum might be compromised due to changes of the related resonance volume. As an example, inflammation might affect the pharynx through acute epiglottitis, in which case swelling of the airway can lead to stenosis. This potentially lethal condition inhibits regular vocal articulation as well as breathing (Omori, 2011). Another condition is incomplete velopharyngeal closure, which leads to increased nasal airflow and resonance causing nasality (or hyperrhinophony) of the speech sounds (Hirschberg et al., 1995). Due to dysfunction of the oral motor mechanism, as it may occur in paresis, the velopharyngeal port in the nasopharynx does not close completely, which disturbs the pharyngo-oral airflow necessary for the pronunciation of oral sounds. In all these cases, regular speech is inhibited due to deviations of the airflow and the associated resonance volume.
A complete, geometrically detailed mapping of vocal tract resonances at the formant frequencies as presented in Sec. IV B gives information on the 3D cavity-association of vowel sounds in a healthy state. Spatially and temporally resolved velocity and pressure fields show the impact of geometrical features as well as of the time-varying source signal. This information can be of use as reference data for clinical purposes and comparison with similar data from patients with impaired vocal function, making it more important to resolve the complete frequency range relevant for human phonation through high-order LES computations.