Virtual reality head-mounted displays affect sidetone perception

: The purpose of this study was to investigate whether head-mounted displays (HMDs) change the sidetone to an auditory perceivable extent. Impulse responses (IRs) were recorded using a dummy head wearing a HMD (IRtest) and compared to IRs measured without HMD (IRref). Ten naive listeners were tested on their ability to discriminate between the IRtest and IRref using convolved speech signals. The spectral analysis showed that the HMDs decreased the spectral energy of the sidetone around 2000–4500Hz. Most listeners were able to discriminate between the IRs. It is concluded that HMDs change the sidetone to a small but perceivable extent. V C 2022 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/) . 1 investigated alterations to the head-related transfer functions (HRTFs) between a sound source at 1m distance to ear canal microphones of a dummy head wearing three different types of HMDs. Spectral differences were analyzed between different azimuth positions of the dummy head (0 (cid:2) –180 (cid:2) ) for the ipsilateral and contralateral HRTF impulse responses, as well as the different HMDs. The results showed that the HMDs changed the spectral content of the HRTFs, with more prominent changes in the contralateral than the ipsilateral ear. Listening tests conﬁrmed the objective results, showing perceptually noticeable timbre and spatial differences for most HMDs and azimuth positions, including 0 (cid:2) .


Introduction
Immersive, three-dimensional (3D) virtual reality (VR) environments are becoming more and more common as tools within medical rehabilitation, psychological treatment, and education. Immersive virtual environments can be displayed through head-mounted-displays (HMDs), providing the user with the experience of actually "being" in the simulated environment, including a sense of disconnection from the real world.
HMDs add acoustic reflective surfaces around the head and face of the user that are not typically present. These surfaces create acoustic distortions that previous research has shown to affect the perception of spatial and timbre quality 1 and accuracy of sound localization. 2 Gupta et al. 1 investigated alterations to the head-related transfer functions (HRTFs) between a sound source at 1 m distance to ear canal microphones of a dummy head wearing three different types of HMDs. Spectral differences were analyzed between different azimuth positions of the dummy head (0 -180 ) for the ipsilateral and contralateral HRTF impulse responses, as well as the different HMDs. The results showed that the HMDs changed the spectral content of the HRTFs, with more prominent changes in the contralateral than the ipsilateral ear. Listening tests confirmed the objective results, showing perceptually noticeable timbre and spatial differences for most HMDs and azimuth positions, including 0 .
Previous research has focused on acoustic perturbations induced by the HMDs with the sound source located in the far field. However, there are several applications for immersive VR where the sound source will be located at the user's mouth, i.e., when speaking or singing. For example, immersive VR has been implemented for public speaking training for university students, 3 people with public speaking anxiety, 4,5 and adults who stutter. 6 Immersive VR systems have also been developed for group singing activities, making it possible to virtually take part in choir singing from home. [7][8][9][10] A talker/singer receives sensorimotor and auditory feedback on his own voice and speech production in real time. The auditory feedback, or sidetone, consists of a bone-conducted and an airborne part. The latter can be further divided into the direct pathway of the sound from the mouth to the talker's own ears and the sound reflected by surrounding surfaces before reaching the talker's eardrums. 11 Changes to the sidetone have been shown to induce compensatory changes in vocal behavior, for instance, the Lombard effect stating that a talker will raise his vocal loudness when background noise increases 12,13 and, on the contrary, decrease vocal loudness when sidetone loudness increases. [14][15][16][17][18][19] Compensatory vocal behaviors have also been found for sidetone pitch alterations 20-24 and formant shifts. [24][25][26][27][28][29][30][31] Therefore, the sidetone has a direct impact on voice and speech regulation.
Since HMDs have been found to change HRTFs measured with an external sound source, they are likely to also affect HRTFs when the sound source is the user's own voice. If so, the use of HMDs could risk inducing compensatory vocal behaviors of the user as a consequence of sidetone changes. The purpose of this study was therefore to investigate a) Author to whom correspondence should be addressed. how sound reflections from a commonly used HMD change the frequency spectrum of the sidetone and to investigate whether or not these spectral differences are perceivable to naive listeners.

Impulse response measurements with and without head-mounted display
To investigate changes in sidetone signal content due to HMDs, impulse responses (IRs) were measured between the mouth loudspeaker and ear canal microphones of a head-and-torso simulator (HATS) (model 4128; Br€ uel & Kjaer Sound & Vibration Measurement A/S, Naerum, Denmark) while it was wearing a HMD (model HTC VIVE; HTC Corporation, New Taipei City, Taiwan) and compared to reference IRs measured in the same way without the HMD (Fig. 1). All measurements took place in an anechoic chamber using 10.9 s long sinusoidal sweeps excited by the DIRAC software (version 7841, Br€ uel & Kjaer Sound & Vibration Measurement A/S). The sampling frequency was 48 kHz. Six experimental IRs (IRtest) were measured with the HATS wearing the HMD. The HMD was removed from and put back on the dummy head between each measurement to mimic typical usage. Ten reference IRs (IRref) were measured without the HMD. Data analyses were performed using MATLAB (MathWorks, Natick, MA). After inspection of the energy-time curve, all IRs were truncated in time domain at 0.25 s as to reduce the influence of noise possibly contaminating the spectrum. The energy and frequency content of the truncated IRref and IRtest were further analyzed and compared in time and frequency domain, respectively, and used for convolving speech signals for a listening experiment.

Equipment and test procedure
A listening experiment was performed to investigate whether naive listeners could discriminate between the IRref and IRtest. Two speech samples from the TIMIT speech corpora 32 were used. The samples consisted of a male and a female voice saying the phrase "she had your dark suit in greasy wash water all year" in New England American English. Both speech samples were approximately 3 s long and had a sampling frequency of 16 kHz. Two yes/no auditory perception tasks were carried out, one with the female voice stimuli and one with the male voice stimuli, using a custom-made MATLAB application. The participants were instructed to listen to two consecutive signals and determine if they differed from each other (YES-option) or if they were identical (NO-option). For each signal pair, a reference signal consisting of the speech sample convolved with a randomly chosen IRref was played first, followed by the second signal, which was either identical to the first one (NO-stimuli) or consisted of the speech sample convolved with a randomly chosen IRtest (YES-stimuli). The signals were separated by 1.5 s, and each test consisted of 20 pairs of stimuli. The experiment was carried out in a double-walled acoustically treated booth, and the sound stimuli were presented using a Fireface UCX soundcard (RME, Haimhausen, Germany) and HD650 open headphones (Sennheiser, Wedemark, Germany). The signals were presented to the participants at an equivalent sound level of 57 dBA, adjusted by connecting the headphones to an artificial ear while measuring the sound pressure level using a Norsonic NOR139 sound level meter (Norsonic AS, Tranby, Norway).

Participants
Ten volunteers were recruited for the listening experiment, seven males and three females. The participants were university students and staff members in acoustics and/or hearing sciences, and all except two reported previous experience of participating in listening tests. The median age was 32 years (range 27-52 years). All participants reported normal hearing, and five reported to have had musical training for at least a few years. The participants were naive to the purpose of the study. All participants provided informed consent before participation, and ethical approval was obtained from the Science-Ethics Committee for the Capital Region of Denmark (reference H-16036391).

Data analysis
Data from the listening experiment were analyzed using signal detection theory (SDT). SDT separates sensitivity of signal discrimination ability from response bias (for an overview of the measures, please see, e.g., Stanislaw and Todorov 33 and See et al. 34 ). The SDT calculations are based on the HIT-rate and false alarm rate. The HIT-rate is defined as the ratio of the number of correctly identified IRref and IRtest stimulus pairs over the total number of IRref and IRtest stimulus pairs (i.e., the ratio of the number of correct YES-replies over the total number of YES-stimuli).
The false alarm rate is defined as the ratio of the number of incorrectly identified IRref and IRtest stimulus pairs over the total number of IRref only stimulus pairs (i.e., the ratio of the number of incorrect YES-replies over the total number of NO-stimuli). The sensitivity of signal discrimination was calculated using the non-parametric discriminability index, A 0 , first described by Pollack and Norman 35 and calculated by 33,36 where H ¼ HIT-rate and F ¼ false alarm rate. An A 0 value close to 1 indicates good discriminability between the signals, whereas an A 0 value of 0.5 indicates chance performance. An A 0 value below 0.5 indicates sampling error or response confusion. 33 The response bias, hence, the participant's underlying criterion before replying "Yes" was analyzed using the non-parametric B 00 D , 37 which in turn is calculated by 34 The B 00 D value ranges between -1 and 1, where a value of 0 indicates no bias, a positive value indicates a tendency to reply "No" (conservative bias; in this case, "the signals are identical"), and a negative value indicates a tendency to reply "Yes" (liberal bias; in this case, "the signals are different"). 37 All SDT calculations were performed using MATLAB and plotted using R. 38 Possible differences in rating performance due to musical background as well as due to speech stimulus gender were analyzed using Wilcoxon signed rank tests in R.

Spectral differences with and without head-mounted display
The differences between the IRtest and IRref can be seen in the time and frequency domain, respectively, in Fig. 2. The magnitude of the IRtest was up to approximately 2 dB lower compared to the IRref in the frequency bands ca.

Listening experiment
Medians and data distribution results from the SDT analysis are found in Fig. 3. The HIT-rate varied between Min ¼ 0.00 and Max ¼ 1.00 for both the female and male voice stimuli, with a Med ¼ 0.65 for the female voice stimuli, and  33 One participant presented HIT-and false alarm rates of 0, and therefore the A 0 index could not be calculated, giving an n ¼ 9 for the A 0 index results for the male voice stimuli. The bias measure B 00 D varied between Min ¼ À1:00 and Max ¼ 1.00 for both the female and male voice stimuli, with a Med ¼ 0.34 for the female voice stimuli and Med ¼ 0:28 for the male voice stimuli. The analyses of differences in rating performance due to musical background and speech stimulus gender both presented non-significant results.

Discussion
The purpose of this study was to investigate how a HMD affects the sidetone and whether these possible alterations are auditory perceivable. The results showed a decrease in spectral energy over several frequency bands when the HMD was worn (IRtest) compared to the no-HMD condition (IRref). The results of the listening test showed that naive listeners were mostly able to discriminate between the IRs when they were presented as convolved speech signals. Previous research has shown that HMDs affect the spectral energy levels of the HRTFs 1 as well as the accuracy of sound localization 2 when using a sound source in the far field. When investigating sidetone, the sound source (i.e., the mouth) is located at equal distance from the receivers (i.e., the ear canals) at all times, most equivalent to 0 azimuth in the far field. Our results are in line with those reported by Gupta et al., 1 who found perceptually detectable differences in timbre quality at 0 azimuth when using a similar HMD (HTC Vive pro) as in the current study. They, however, found the most prominent spectral differences (4 dB lower with HMD compared to without HMD) in the high frequency region (5-16 kHz and at 0 azimuth), whereas we found the largest decreases in magnitude in the frequency band 2100-4700 Hz.  The results of the SDT analyses of the listening test showed that the ability to distinguish between the IRtest and the IRref varied between the participants. The A 0 -value for some participants approached 1, indicating almost perfect performance. These participants tended to comment afterward that they had noticed a small shift in timbre for some of the stimuli but still experienced the task as difficult and the signals overall to be quite similar. A few participants demonstrated no ability to distinguish between the IRs at all, which confirms that the task was indeed difficult and the differences between the IRs minor. The bias measure B 00 D varied between the extremes -1 and 1, indicating a big variation in the participants' underlying criterion when deciding upon whether they perceived a difference or not.
However, as some participants demonstrated a clear ability to discriminate between the IRs, we find that the result supports our hypothesis that the HMD changes the sidetone to a perceivable extent. This, in turn, raises the question of whether the sidetone alteration is big enough to affect voice or speech production. There is substantial evidence that shifts in sidetone loudness or pitch induce compensatory behaviors in vocal loudness and pitch regulation, respectively. [12][13][14][15][16][17][18][19][20][21][22][23][24] The same is true for sidetone alterations of formants, as several studies have demonstrated compensatory behaviors in vowel articulation when the first and/or second formant frequency have been shifted. [24][25][26][27][28][29][30][31] We have demonstrated small changes in spectral energy levels over frequency bands overlapping typical first and second formant frequencies and more prominent changes in frequency bands overlapping typical third and fourth formant frequencies (for an overview of formant frequencies in American English, see Kent and Vorperian 39 ). This suggests that the HMD could induce formant shifts that, in turn, could provoke compensatory behaviors.
Real-time perceptual differences and speech characteristic changes due to wearing a HMD need to be addressed in further studies to take possible compensatory strategies into account if HMDs are to be used within fields in which small changes in vocal behavior could affect the outcome of interest. Such areas are, for instance, medical voice therapy or singing training. Although immersive VR already has been implemented for different speaking and singing activities, such as public speaking training 3-6 and choir singing, 7-10 the focus of this research has been on aspects other than the individual speaker's/singer's voice use. Compensatory vocal and speech behaviors could be investigated through real-time auralization, in which the participant's speech signal would be convolved with the IRtest and IRref, respectively, and played back to the participant during continuous speech. This would also allow for subjective analysis of sidetone perception, using the speaker's own voice instead of pre-recorded speech stimuli.
There are different types of HMDs on the market, and a limitation of the current study is that only one type of HMD was investigated. Therefore, the results are not transferable to models other than the one tested. Gupta et al. 1 analyzed three different HMDs from different manufacturers and found the biggest spectral differences as well as perceptually detectable timbre changes when using the HTC Vive pro compared to the other two tested (Oculus Rift and HoloLens). The authors suggest that this could be due to the slightly larger size of the HTC Vive pro compared to the other two HMDs. The HMD used in the current study, HTC Vive, is similar in size and shape to the one used by Gupta et al., 1 and it is possible that other types of HMDs would alter the sidetone differently and to a lesser extent. It would therefore be important to investigate sidetone changes using HMDs from different manufacturers.

Conclusions
The HTC Vive head-mounted VR display changes the spectral energy levels of the sidetone, particularly in the frequency band 2100-4700 Hz. These spectral differences are perceivable to naive listeners. There is a need to investigate whether these changes affect vocal behavior during speaking or singing with a HMD.