Perceptual equivalence of the Liljencrants–Fant andlinear-filter glottal flow models

Speech glottal flow has been predominantly described in the time-domain in past decades, the Liljencrants–Fant (LF) model being the most widely used in speech analysis and synthesis, despite its computational complexity. The causal/ anti-causal linear model (LFCALM) was later introduced as a digital filter implementation of LF, a mixed-phase spectral model including both anti-causal and causal filters to model the vocal-fold open and closed phases, respectively. To further simplify computation, a causal linear model (LFLM) describes the glottal flow with a fully causal set of filters. After expressing these three models under a single analytic formulation, we assessed here their perceptual consistency, when driven by a single parameter Rd related to voice quality. All possible paired combinations of signals generated using six Rd levels for each model were presented to subjects who were asked whether the two signals in each pair differed. Model pairs LFLM–LFCALM were judged similar when sharing the same Rd value, and LF was considered the same as LFLM and LFCALM given a consistent shift in Rd. Overall, the similarity between these models encourages the use of the simpler and more computationally efficient models LFCALM and LFLM in speech synthesis applications. VC 2021 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). https://doi.org/10.1121/10.0005879 (Received 28 April 2021; revised 2 July 2021; accepted 22 July 2021; published online 19 August 2021) [Editor: Paavo Alku] Pages: 1273–1285


I. INTRODUCTION
The acoustic theory of speech production formalised by Fant (1960) assumes independence and linearity between the airflow modulated in the glottis by the vibration of the vocal folds, called glottal flow, and the resonance effect of the vocal tract that shapes the glottal flow into a speech signal. The linear acoustic theory offers a somewhat simplified view of the physics of speech production, but it is still a very effective and widely used representation of voice signals for speech processing applications (e.g., speech coding, synthesis, parameterization) and acoustic phonetics analyses. In this theory, vocal tract resonances introduce spectral formants and anti-formants (maxima and minima of the spectral envelope) that characterise speech sounds. Vocal tract formants are themselves often associated with linear filters: series or parallel branches of second order resonant sections in formant synthesisers; auto-regressive filter models in linear prediction. In early applications, the voice source component was also considered as a low-pass filter, the so-called glottal formant. The transmission line analog proposed by Fant (1960) used a four-pole model subsequently simplified in a two-pole model in linear prediction of speech by Markel and Gray (1982). Note that this glottal formant is not related to a physical resonance but describes the spectrum of the glottal pulse, modelled as the impulse response of the low-pass filter. However, glottal filter impulse responses poorly match glottal flow waveforms obtained by inverse filtering or by indirect measurements like electroglottography. This has led to the proposition of a multiplicity of glottal flow models (GFMs) defined in the time-domain by analytic and parametric formulations of the glottal flow waveform and its derivative: Rosenberg (1971) (Rosenberg model); Hedelin (1984), Fujisaki and Ljungqvist (1986), and Klatt and Klatt (1990) (KLGLOTT88 model); Fant et al. (1985) [Liljencrants-Fant (LF) model]; Veldhuis (1998) (Rþþ model). These widely used models adopt various mathematical functions to describe the glottal flow oscillation, yet Doval et al. (2006) showed that the Rosenberg, KLGLOTT88, LF, and Rþþ models can be grouped under one general expression that is parameterized by a common set of five parameters. Variations of these parameters are closely related to voice quality perception (e.g., breathiness, tenseness, vocal force), that strongly motivates the use of GFM in expressive speech related research. This includes analysis of emotion in speech [Gobl and N ı Chasaide (2003), Patel et al. (2011), andN ı Chasaide et al. (2013): LF model; Burkhardt and Sendlmeier (2000): KLSYNTH88 model]; analysis-resynthesis schemes for voice modification [Childers (1995), Cabral et al. (2014), and Degottex et al. (2013): LF model]; or expressive textto-speech synthesis [Raitio et al. (2013), Airaksinen et al. (2016), and Juvela et al. (2019): LF model]. This list is not exhaustive; however, LF has been the most widely adopted model for analysis and synthesis of speech signals.
The main limitation of the LF model is its computational complexity. It requires solving implicit equations that can only be performed with numerical approaches. This model is not suitable for applications where computational complexity is a constraint, such as real-time speech or singing synthesis. Also, spectral glottal flow models are desirable because voice quality is often described in spectral terms (e.g., voice spectral tilt, brightness, tenseness): spectral parameters are closer to perception than time-domain parameters. It is therefore interesting to investigate the apparent discrepancy between GFM like LF and filter impulse-response models. Along this line, Doval et al. (2006) highlighted that LF and the other time-domain models under study have a simple magnitude representation in the frequency-domain that can be modelled with a third order filter, as also noted by Childers and Lee (1991). This has led to the proposal of new models: the causal/anti-causal linear model (LF CALM ) by Doval et al. (2003), followed by an all-causal linear model (LF LM ) used in the Cantor Digitalis singing synthesiser (Feugère et al., 2017), which both gradually simplify the computation of the glottal flow by using digital filters instead of analytic functions, thus enabling a precision-complexity trade-off, LF being the most precise and LF LM the simplest. While we will show in Sec. II that the simplification operating on LF CALM and LF LM can substantially modify the glottal flow waveform, it is not clear if this affects their auditory perception. The aim of this paper is threefold. Section II studies the three models LF, LF CALM , and LF LM in terms of linear filters. Formulations for impulse responses are derived, and differences between the models are investigated. After this objective and analytic comparison, subjective experiments are conducted in Sec. III for assessing the perceptual equivalence of the three models. Armed with analytic formulations and perceptual analyses, the discussion in Sec. IV summarises the results obtained: linear-filter formulations equivalent to the LF model are able to account for both the observed glottal formant and glottal flow waveforms.

II. LINEAR-FILTER FORMULATION OF GLOTTAL FLOW MODELS
A. Glottal flow model parameters: LF and R d All GFMs attempt to describe a vocal-fold vibration period in time-domain (see Fig. 1). Three phases are considered: the opening phase (lung pressure forces the vocal folds to spread, and an increasing air flow passes through the glottis); the closing phase (the elasticity of the vocal folds takes over, closing the air passage); the closed phase (the airflow is blocked). Then the lung pressure increases again, and a new opening phase follows. This cycle can be represented by five parameters : the cycle period T 0 or fundamental frequency F 0 ¼ 1=T 0 ; the cycle amplitude, generally represented by E, the maximum of the absolute value of the glottal flow derivative (GFD) (i.e., the negative peak at the glottal closure instant has amplitude ÀE); the open quotient O q , the ratio of the open phase duration T e over the period T 0 ; the asymmetry coefficient a m , the ratio of the opening phase duration T p over the open phase duration; and T a , the closing time duration (Fig. 1). Period T 0 and amplitude E change the time and amplitude scales of the glottal flow. The three other parameters change the shape of the glottal flow and account for the voice timbre or quality. Empirically, Fant et al. (1994) established that the perceptual effect of the shape parameters O q , a m , and T a can be gathered into a unique high-level parameter called R e initially, R d afterward (Fant, 1995)  (long open phase, symmetry of the glottal flow providing a relaxed voice). Note that the time-domain NAQ-coefficient proposed later (Alku et al., 2002) is proportional to R d . R d will be used as a control parameter below.

B. Glottal formant, LF CALM , and LF LM
Radiated sound pressure outside the vocal tract can be approximated by the derivative of the speech flow measured at the lips. For this reason, the glottal flow derivative is often preferred to the glottal flow for analysis purposes. The spectrum of the glottal flow derivative shows a marked spectral peak, the glottal formant. Figure 1 displays on the right the magnitude spectrum of the glottal flow derivative computed with the LF model, superimposed on a third order filter approximation. Two poles form the glottal formant, a lowfrequency resonance with centre frequency F g that is directly related to the oscillation of the open phase of the vocal folds. The remaining pole is an extra attenuation with cut-off frequency F ST , called spectral tilt, that is responsible for the smoothness of the closed phase of the vocal folds. Phase analysis has shown that this third order approximation is a mixed-phase model (Gardner and Rao, 1997), allowing it to represent the open phase of the LF model as a second order filter response (damped sinusoid) that evolves toward negative time, while the closed phase resembles the response of a first order filter (decreasing exponential) that evolves toward positive time (bottom-left of Fig. 1). Following this analysis, the causal/anti-causal linear model of the glottal flow (LF CALM ) has been proposed by Doval et al. (2003) to generate a glottal derivative waveform by filtering a pulse train with the mixed-phase third order filter. The LF CALM is a simple formulation reproducing the dual relations between timedomain parameters and spectral shape (Gobl et al., 2018;Henrich et al., 2001). A real-time implementation of the model called RT-CALM was then derived by d' Alessandro et al. (2006). The mixed-phase characteristic of the glottal flow has been exploited for the estimation of the glottal flow from speech signals (Bozkurt et al., 2005;Drugman et al., 2011;H ezard et al., 2013). The glottal formant can also be represented by causal filters, following Klatt (1980) and Holmes (1983), but at the expense of some distortion in the phase spectrum compared to the LF model. A formulation of this causal linear voice source model LF LM has been proposed and used for real-time voice synthesis and voice source analysis (Feugère et al., 2017;McLoughlin et al., 2020;Perrotin andMcLoughlin, 2019, 2020). The perceptual effect of this phase difference is studied in Sec. III.
To summarise, the LF model that is widely accepted as a precise time-domain GFM has been simplified by a frequency-domain representation that uses a mixed-phase third order filter called LF CALM . To go further in reducing computation complexity, an all-causal linear model (LF LM ) has been recently formulated.
All three GFMs are defined in terms of their open and closed phase, described separately in Secs. II C and II D. For this reason, we define glottal opening instants (GOIs) that mark the beginning of each open phase and are spaced by a duration of T 0 and glottal closure instants (GCIs) marking the beginning of each closed phase. GOIs and GCIs are spaced by a duration of O q T 0 . Let us define the impulse response of a truncated second order filter, whose generic formulation is h T ðtÞ ¼ G n e a n t sin ðb n t þ / n Þ if t 2 D h T ðtÞ ¼ 0 elsewhere: (1) If h T is anti-causal, T < 0 is the instant of truncation, D ¼ ½T; 0, and a n > 0. Its causal counterpart is defined for T > 0, D ¼ ½0; T, and a n < 0. It can be shown that the open phase definitions of the three GFMs under study can be formulated with respect to Eq. (1) by setting appropriately the G n , a n , b n , / n , and T parameters (index n is subsequently replaced by the name of the model in consideration: LF, CALM, or LM). In their original formulations, LF is defined as a continuous timedomain function, while LF CALM and LF LM are defined as digital filters (Z-domain). For the sake of generalisation, all expressions are given below as equivalent continuous representations (time and Laplace domains), and derivation details from the original papers' formulations are given in Appendixes B, C, and D.
a. LF. The LF model (Fant et al., 1985) is defined by an analytic function in the time-domain relative to the GOI and can be interpreted as an unstable, divergent, and truncated causal filter. However, re-parameterization with O q and a m and setting the time origin at the GCI (see Appendix B) allow us to express LF as an anti-causal filter truncated at T LF ¼ ÀO q T 0 , matching Eq. (1). The equations below give the resulting waveform analytic expression and its Laplace transform: One can now identify from the top equation the values of parameters G LF ; a LF ; b LF ; / LF , and T LF that are summarised in Table I. a LF is the open phase damping coefficient. It is set so that the airflow of a period is zero and results from an implicit equation (see Appendix B).
b. LF CALM . The causal/anti-causal linear model uses a second order anti-causal and truncated bandpass filter to model the open phase of the glottis (Doval et al., 2003), whose equation and parameters are derived in Appendix C.
The time-domain response of LF CALM , truncated at T CALM ¼ ÀO q T 0 , and the frequency-domain response are given by computing the inverse Z-transform and Laplace transform of the filter, respectively, We find again the general formulation of Eq. (1), and the LF CALM parameters are summarised in Table I. c. LF LM . The LF LM model (Feugère et al., 2017) is the causal version of LF CALM with the difference that the filter is not truncated, since it converges (see Appendix D). The time and frequency responses of LFLM, whose parameters are given in Table I (4) This results in 620 dB/decade asymptotes. In particular, all Laplace transforms simplify to E/s at high frequencies, resulting in similar asymptotes for the three GFMs. At low frequencies, the asymptotes are shifted between models but only from a few dB.

Comparison between the GFM open phases
LF and LF CALM display two more similarities. First, their anti-causality causes the GFD phase to increase (bottom-right of Fig. 2); second, they are both truncated at t ¼ ÀO q T 0 . The thin dashed curves in the left panels show what would be non-truncated versions of LF and LF CALM . A direct effect of the truncation is the computation of their Laplace transform on the interval ½ÀO q T 0 ; 0, which results in the appearance of the term e ÀsT in H LF open and H CALM open . This causes the ripples observed in the LF and LF CALM spectra. The main difference between LF and LF CALM is that the former is parameterized to be class C 1 i.e., with a continuous GFD at the GOI (ÀO q T 0 ). This parameterization results in a generic second order filter that is neither lowpass nor bandpass, as shown by the numerator of H LF open . A consequence is the large lobe around the resonance frequency of the GFD magnitude spectrum. Conversely, LF CALM is parameterized to be a bandpass filter, which allows a reduction of the resonance's lobewidth but cannot suppress it completely because of the effect of truncation. The consequence of the bandpass parameterization is a discontinuous GFD at the glottal opening instant.
Two differences between LF CALM and LF LM are also highlighted. The difference of causality is well-displayed by a vertical symmetry in the time-domain and a horizontal Open phase symmetry of the phase spectrum. Also, because LF LM converges, it is not truncated at O q T 0 . This implies a leak of the period to the next one but also greatly simplifies its implementation. As a result, its spectrum is the exact frequency response of a bandpass filter, with no ripples and no lobe around the resonance centre frequency. Note that the vertical symmetry of the GFD between LF CALM and LF LM implies a sign inversion of the glottal flow, but one that the ear is not sensitive to.
D. Modeling the closed phase

Formulation of the closed phases
Definitions of the GFM closed phase fall within two categories : it is either described in the timedomain by an analytic formulation, as LF, or defined in the frequency-domain with a first order filter, as LF CALM or LF LM . a. LF: Analytic expression. The closed phase of the LF model, after shifting the glottal closing instant at t ¼ 0, is expressed as (5) where is the closed phase coefficient. It satisfies the continuity of the open and closed phase expressions at the GCI and is obtained from an implicit equation (see Appendix B). Note that because a LF is computed from , the shape of the open phase depends on the closed phase, although both phases are defined by distinct analytical expressions.
b. LF CALM and LF LM : Filtering. With LF CALM and LF LM , the closed phase is modelled by a first order low-pass filter attenuating high frequencies above its cut-off frequency F a ¼ 1=ð2pT a Þ and called spectral tilt (Doval et al., 2003;Feugère et al., 2017). Filter formulation is given in Appendix C. In these cases, the spectral tilt filter is applied on the full signal and therefore changes the open phase shape. The top-right panel shows high similarity between the three GFMs' spectrum magnitudes. The closed phase adds a supplementary À20 dB/decade attenuation to all open phase spectra, resulting in a À40 dB/decade attenuation at high frequencies. We can also observe an increase in gain in low frequencies for the LF model. This is directly linked to the change of the a LF parameter. A consequence is the largest amplitudes of the glottal flow and glottal flow derivative for LF.

Comparison between the GFM closed phases
Looking at the phase spectrum (bottom-right panel), LF and LF CALM almost overlap, showing a similar effect of the closed phases on their respective phase spectra: it adds a supplementary Àp=2 offset at high frequencies to all phase spectra of the open phases. The spectral tilt filter is displayed in black. This offset introduces an asymmetry between LF and LF CALM on one side and LF LM on the other. The addition Àp=2 at high frequencies for all models reduces the phase of LF and LF CALM from 3p=2 to p but also reduces the phase of LF LM from À3p=2 to À2p. This asymmetry is reflected in the shapes of the glottal flow derivatives (bottom-left panel). One can see that LF LM and LF CALM are not symmetrical anymore and that the filtering attenuates more the GFD peak near the glottal closure instant for LF LM than for LF CALM . Finally, it is important to mention that the spectral tilt filter is not truncated for LF CALM and LF LM , and its application results in an infinite response that may overlap with the next period. This appears for high values of R d , as shown in Sec. II F.

E. Assessment of computational costs
To evaluate the computational efficiency of each GFM, we measured the average time necessary to compute one period of a 1-s stationary signal for each model. The ratio of computation time over the period duration gives the realtime factor. A real-time factor below 1 means that the signal is faster to compute than to play back, so we can listen the signal while it is generated. Inversely, a real-time factor higher than 1 indicates that the signal takes longer to compute than to play back. This experiment was made in the condition of a fine-grain control of the GFM: parameters are calculated for each period. To assess the dependency of the real-time factor on F 0 and R d , we generated 564 stationary signals using a combination of the six R d values described in Sec. III and 94 F 0 values, from 70 to 1000 Hz with steps of 10 Hz. All signals were generated on an iMac Intel Core i9, with a 3.6 GHz processor. Figure 4 displays the real-time factors for the three GFMs depending on F 0 . For each model and F 0 value, we computed the mean and standard deviation of the real-time factor across the six R d values. The means for each model are represented by the thick coloured lines, and the shading around each mean value highlights the 6 standard deviation range around the mean. LF CALM and LF LM are more than 10-100 times faster than LF. This is a direct consequence of the resolution of the implicit equation for the LF model, which is costly. Also, the efficiency of LF decreases with higher F 0 because the resolution of the implicit equation requires a constant duration. Therefore, when the period duration decreases, the real-time factor increases, and this dependency between computation efficiency and input parameter is not desirable. Finally, R d has no effect on the computation time for all three GFMs.
F. Summary of the model implementation and effect of R d Table I  represent the GOIs, while the dotted lines show the GCIs. In the second and third row, the vertical line indicates the cutoff frequency of the spectral tilt filter. Globally, R d has a similar effect on the three GFMs. Looking at the spectrum magnitude, low values of R d lead to higher centre frequency and bandwidth of the glottal formant and a higher spectral tilt cut-off frequency. These combined effects favour the presence of numerous harmonics that give a sharp GFD closure, close to the shape of an impulse. This is typical for tensed and loud voice, when the vocal folds open and close abruptly. Inversely, high values of R d lower the centre frequency and bandwidth of the glottal formant as well as the spectral tilt cut-off frequency. It thus emphasises more the first and second harmonics, leading to a more sine-like GFD shape. This is lax/soft voice, when the vocal folds oscillate more symmetrically.
In the first column of Fig. 5, the three GFMs appear very similar for two reasons. First, a low value of R d leads to a high attenuation coefficient a n that allows LF and LF CALM to have almost horizontal tangents at GOI. The truncation thus does not introduce an abrupt change of slope on the GFD, which results in a reduction of ripples on the LF and LF CALM spectra. Second, the effect of spectral tilt that introduces an asymmetry between LF CALM and LF LM is small (high cut-off frequency), leading to almost symmetrical LF CALM and LF LM GFDs. Inversely, the three GFM shapes diverge with increasing values of R d . Truncation has stronger effects on LF and LF CALM , increasing ripples in their spectrum, and the spectral tilt whose cut-off frequency is closed to the glottal formant position has a strong effect on the GFD shapes. In particular, one can note that the minimum values of LF CALM and LF LM diverge from ÀE when R d increases. Moreover, the last column illustrates well the effect of absence of truncation of the spectral tilt filter on LF CALM and LF LM . The GFD computed for one period overlaps on the next one, leading to negative (respectively, positive) value of the GFD at the GOI for LF CALM (respectively, LF LM ).
We have shown that the difference of construction between the three GFMs (formulation, causality, truncation) leads to clear visible differences in the GFD waveforms and spectra. However, their effect on auditory perception is unclear and is assessed in Sec. III.

III. PERCEPTUAL COMPARISON OF VOICE SOURCE MODELS
A. Experiment

Protocol and task
The aim of the experiment was to assess any perceptual difference between the three GFMs for different values of the R d parameter. We used for this purpose a two-alternative forced-choice (2AFC) protocol (Kingdom and Prins, 2016), where each subject's task was to listen to paired sounds and to say if they were the same or different, with respect to any distinctive features, whatever their nature (e.g., timbre, level, pitch, etc.). The experiment was divided into three blocks. The first block used synthesised sounds from the GFMs only. The second and third blocks used additional /a/ and /i/ vocal tract models convolved with the GFMs. These two vowels were chosen for their lowest (/i/) and highest (/a/) first formant frequency in order to test a more natural vocal sound than the GFM alone.
For each GFM and following Degottex et al. (2013), six values of R d were chosen equally spaced on a logarithmic scale, leading to three GFMs Â 6 R d ¼ 18 stimuli per block (the one displayed on Fig. 5). Then for each block, every combination of pairs of different GMFs was tested (LF LM Â LF CALM ; LF LM Â LF; LF CALM Â LF; LF CALM Â LF LM ; LF Â LF LM ; LF Â LF CALM ). Finally, 3 vowels Â 6 pair combinations Â 6 R d values for the first element of the pair Â 6 R d values for the second element of the pair led to a total of 648 pairs of stimuli to compare.
A computer interface was specially designed for this experiment and programmed in MAX 6. 1 The protocol was identical for all the paired stimuli. To proceed, the subject clicked a button, which launched the playback of two sounds, A and B, separated by 500 ms. The test sounds were ordered randomly and played for each subject only once to keep sessions as short as possible and identical among subjects. The subject had to choose whether the two sounds were identical or different, without any other choice. Each block lasted approximately 10 min, and subjects were especially encouraged to stop and rest between the three blocks with a message displayed automatically. The entire experiment took place in an acoustically insulated and treated room designed for perceptual experiments. Sound was played using a Focusrite (High Wycombe, UK) Scarlett 2i2 audio interface on a Mac OSX and AKG (Los Angeles, CA) K271 headphones. Before the experiment, subjects were trained with a subset of the sound-pair list (GFM convolved to /a/ vocal tract or without vocal tract, three R d values spread over the full range of possible values).
A group of 18 subjects took part in the experiment (median age of 28 years, from 21 to 54 years old). Among them, 12 subjects worked in the field of sound technologies, and six others had a regular musical practice. An audiogram test was performed for each of the subjects, and none of them reported any known auditory impairment except one who was single-side deaf, but stereo listening was not needed to perform the task. Fourteen subjects were members of the laboratory and participated in the experiment on a voluntary basis without being paid. The four remaining subjects were paid for the experiment.

Stimuli specification
Stimuli were synthesised at a sampling rate of F s ¼ 96 kHz. A constant fundamental frequency of F 0 ¼ 110 Hz and a peak amplitude E ¼ 0.2 were chosen. The LF GFDs were generated by using the analytic formulations of Eqs. (2) and (5) and by solving the implicit Eqs. (B3) and (B4). The LF CALM and LF LM GFDs were generated by filtering a pulse train with their respective open and closed phase filters (Appendix E). All signals lasted 0.3 s, a duration longer than a standard spoken syllable but short enough to facilitate recall of the two stimuli for comparison. Fade-in and fade-out amplitudes were applied using half Hanning windows of length 10T 0 ¼ 0:09 s. Vowels were invariant in time and were applied by filtering the GFM with a bank of five parallel resonant filters corresponding to vowels /i/ and /a/, whose transfer functions are given in Feugère et al. (2017). Finally, all stimuli were normalised in dBA. 2

B. Results
Results report the proportion of pairs that were judged similar depending on the factors in consideration. In particular, we factorised the six different model pairs into two factors: the Model/factor (three levels: LF Â LF CALM ; LF LM Â LF; LF CALM Â LF LM ) and the Order factor that codes the order of presentation of each pair (two levels). The additional factors are Vowel (three levels: source only; /a/; /i/) and R d (36 levels for all combinations of the six selected values). In the following, we used a single generalised linear model following a binomial distribution to assess the significance of each factor and their interactions for the perception results. The obtained model was subsequently simplified by iteratively removing non-significant interactions between factors provided that, at each simplification step, the current and the simplified models do not significantly differ (p > 0.05) (Crawley, 2013). Post hoc Pearson's chisquared tests were run to assess whether proportions obtained for single conditions significantly differ from chance. Figure 6 shows perceptual experiment responses for all factors and interactions except Order (results for both presentation orders of each pair are merged). The top-left panel shows results relative to the R d factor only. Each square corresponds to the proportion of pairs judged similar for a given couple of R d values, all models and Vowels combined. Pairs in black and white were judged similar by 100% and 0% of the subjects, respectively. Scores that fall within the red rectangle on the colour bar do not significantly differ from chance according to the post hoc Pearson's chi-squared tests (p > 0.05). On the left-hand side of the figure, the top row (columns 2-4) shows the model Â R d interaction, with all levels of Vowel and Order combined; the left column (rows 2-4) shows the Vowel Â R d interaction, with all levels of model and Order combined; the remaining panels show the Vowel Â model Â R d interaction for each level of Vowel and model, indicated in the top and left margins of the figure. Panels with yellow and green contours are replicated on the right side, with LF put in the abscissa. On top (respectively, bottom) for each R d (LF) value (each column), the distribution of perceived similar R d (LF CALM ) [respectively, R d (LF LM )] values was obtained and superimposed on the figure, the circles being the medians and the error bars corresponding to 90% of the values around the median. Smaller circles indicate scores below the level of significance (Pearson's chi-squared test). The shaded area links all error bars and represents the space of perceptual equivalence between R d (LF) and R d (LF CALM ) (respectively, R d (LF LM )).

Effect of R d and order
The R d factor has the strongest effect on results [R d : v 2 ¼ 3620, degrees of freedom (df) ¼ 35, p < 0.001]. The top-left panel of Fig. 6 clearly shows that, over all other factors, pairs with similar values of R d are strongly perceived as similar, and vice versa. This confirms that R d has a strong perceptual effect on the synthesis of glottal flow. Presentation order had no influence on similarity judgment (Order: v 2 ¼ 0, df ¼ 1, p ¼ 0.90). Therefore, all results displayed in Fig. 6 and detailed below combine the scores of both presentation orders.

Effect of model
The model factor alone has a small and marginally significant effect on subjects' scores (model: v 2 ¼ 8:8, df ¼ 2, p ¼ 0.012) and therefore demonstrates that the three models are perceptually close to each other. LF CALM and LF LM are judged the most similar models, and LF and LF LM are judged the least similar when all answers are averaged. The subjects' perception seems to reflect the differences between models' construction that are summarised in Table I Fig. 6 (columns 2-4). The first observation is that stimuli with similar values of R d are judged extremely similar (close to 100% similarity), while stimuli with different values of R d are judged different (0% similarity). One can then note a diagonal asymmetry in the LF Â LF CALM and LF LM Â LF panels for R d values higher than 0.86, i.e., when the models start to differ the most (Fig. 5). In particular, subjects judged LF and LF CALM similar mostly when R d (LF CALM ) was greater than or equal to R d (LF). Similarly, LF and LF LM were mostly judged similar when R d (LF LM ) was greater than to or equal to R d (LF). Conversely, LF CALM and LF LM were judged the most similar when they shared the same R d value, picturing more symmetric results (top-right panel of the left side of Fig. 6).
The right side of Fig. 6 summarises this asymmetry between LF and the other models. Recall that these panels are replicates of the one with yellow and green contours from the left-hand side, but with LF put in the abscissa for both plots. For each R d (LF), medians of corresponding distributions of perceived similar R d (LF CALM ) [respectively, R d (LF LM )] are all on or above the diagonal. Also, the spread of each distribution represented by the error bars (90% of the values around the median) and emphasised by the shaded areas clearly displays asymmetrical spaces of perceptual equivalence between R d (LF) and R d (LF CALM ) [respectively, R d (LF LM )] that are again above the diagonal, with R d (LF CALM ) [respectively, R d (LF LM )] mostly equal to or greater than corresponding R d (LF).

Effect of vowel
The effect of vowels (Vowel: v 2 ¼ 17:5, df ¼ 2, p < 0.001) supports that GFDs presented alone were significantly judged less similar than when they were passed through a vowel, the vowel /i/ giving the highest similarity results. Therefore, the introduction of resonances in the signal mitigates the perception of the glottal source timbre. Moreover, the glottal formant F g evolves within the range [64,121] Hz for the chosen values of R d for all models. The vowel /i/, having its first formant resonance the closest to F g , could mask the effect of R d variation, leading to sources judged more similar with /i/ rather than vowel /a/.
Also, a significant two-way interaction with R d is present (Vowel Â R d : v 2 ¼ 302, df ¼ 70, p < 0.001) as shown in the left column of Fig. 6, rows 2-4. Stimuli presented with the source only show similarity concentrated around the diagonal. When presented with the vowel /i/, the similarity spreads across adjacent R d values for high R d . This corresponds to F g and F ST values that are around 100 Hz, close to the first formant frequency of vowel /i/ (215 Hz). Conversely, for the /a/ vowel, it seems that stimuli with high R d value were neither clearly perceived as similar nor dissimilar. In this case, the first formant frequency (700 Hz) is far above the F g and F ST ranges. A possibility is that subjects either focused on the low or high frequency parts of the signal, the former hearing the source differences and the latter focusing on the /a/ resonance.

Remaining interactions
No significant three-way interaction between Vowel and model and R d was detected. It can be seen in Fig. 6 that the trend previously observed in the top row and left column (two-way interactions) applies to the remaining plots. Statistical analysis did not reveal a significant Vowel Â model interaction, showing that the perception of differences between models is relatively independent from the addition of a vocal tract. Although it would be necessary to cover a larger number of vocal tract configurations, this finding encourages the hypothesis that the choice of the glottal flow can be made independently from the behavior of the vocal tract. Finally, two-way interactions Order Â R d and Order Â model result from the asymmetry of the model levels (v 2 ¼ 98, df ¼ 35, p < 0.001; v 2 ¼ 7:6, df ¼ 2, p ¼ 0.022, respectively). The top row of Fig. 6 showed an asymmetry between LF and LF CALM and between LF LM and LF. When considering the order of presentation as a factor, e.g., distinguishing LF Â LF CALM vs LF CALM Â LF, the asymmetry of LF CALM Â LF results is reversed compared to LF Â LF CALM , hence the two-way interaction.

IV. DISCUSSION AND CONCLUSION
In this study, the LF model is reformulated in terms of linear filters. This formulation reconciles the apparent discrepancy between time-domain GFM and spectral voice source models. It allows for quantitative spectral interpretation of the LF model parameters because the correspondence between time-domain and spectral parameters can be analytically computed. This unifies Fant's views on the voice source: the key point is the interpretation of the LF GFM [in Fant et al. (1985)] as a mixed phase system and not as a simple resonant filter [as in Fant (1960)]. The joint variation of the waveform and glottal formant as a function of R d can be computed for voice quality analysis and synthesis. As a rule of thumb, increasing R d corresponds to lowering the glottal formant centre frequency (often referred to as the "voicing bar" in wideband spectrogram reading) and increasing the spectral tilt toward lower frequencies (the right-hand "skirt" of the glottal formant).
Following the proposal of glottal flow models that attempt to reduce the computational complexity of LF, namely LF CALM and LF LM , we sought to assess the perceptual consistency of these models. We first showed that even though LF is defined from an analytic expression and LF CALM and LF LM from digital filters, they can all be expressed by the same analytic function, with their own set of parameters. In terms of construction, LF and LF CALM have anti-causal and truncated open phases, while LF LM has a causal and non-truncated open phase. The three GFM closed phases are causal.
Perceptual pairwise-comparison of these models parameterized with various levels of R d using a same-different forced-choice paradigm on short stationary signals shows that all models are perceived similarly, in that they share the same R d parameterization with a possible offset. In particular, LF LM and LF CALM are perceived similarly with the same R d , while LF is perceived similarly as LF CALM and LF LM when LF has a smaller R d value. Investigation seems to show that this shift in perception relates more to the truncation of the glottal flow open phase than to a difference of causality. Nevertheless, this needs to be confirmed in further experiments. Finally, we showed that the addition of vocal tract effect with low vocalic formants increases the perception of similar waveforms when R d varies slightly between two waveforms. If the high dissimilarity between waveforms (Fig. 3) has favoured the use of LF for precise analysis of the glottal flow (i.e., time-domain analyses), the perceptual consistency between models encourages the use of LF CALM and LF LM as simpler models than LF for speech synthesis applications and for spectral analyses of the voice source and voice quality.
APPENDIX A: HIGH-TO LOW-LEVEL GLOTTAL PARAMETERS Fant (1995) derived a unique high-level parameter R d to control all low-level parameters O q , a m , and T a . He first defined intermediate parameters R a , R k , and R g from which are derived the low-level parameters, R a ¼ ðÀ1 þ 4:8R d Þ=100 R k ¼ ð22:4 þ 11:8R d Þ=100 R g ¼ R k ð0:5 þ 1:2R k Þ 0:44R d À 4R a ð0:5 þ 1:2R k Þ 8 > > > > < > > > > : ) For each period, the impulse response is truncated at the previous GOI. Then the full signal is filtered by the causal spectral tilt filter H ST [Eq. (C5)], leading to the recursion equation In the case of LF LM , both glottal formant and spectral tilt filters are applied in their causal form. We define a pulse train d goi whose impulses are placed on the GOIs. The pulse train is then filtered successively by the causal version of the glottal formant filter H LM open [Eq. (D1)], leading to the recursion equation x LM open n ½ ¼ b 1 d goi n À 1 ½ þ b 2 d goi n À 2 ½ Àa 1 x LM open n À 1 ½ À a 2 x LM open n À 2 ½ ; (E3) and the spectral tilt filter H ST [Eq. (C5)], leading to the recursion equation