The Lombard effect observed in speech produced by cochlear implant users in noisy environments : A naturalistic study

The Lombard effect is an involuntary response speakers experience in the presence of noise during voice communication. This phenomenon is known to cause changes in speech production such as an increase in intensity, pitch structure, formant characteristics, etc., for enhanced audibility in noisy environments. Although well studied for normal hearing listeners, the Lombard effect has received little, if any, attention in the field of cochlear implants (CIs). The objective of this study is to analyze speech production of CI users who are postlingually deafened adults with respect to environmental context. A total of six adult CI users were recruited to produce spontaneous speech in various realistic environments. Acoustic-phonetic analysis was then carried out to characterize their speech production in these environments. The Lombard effect was observed in the speech production of all CI users who participated in this study in adverse listening environments. The results indicate that both suprasegmental (e.g., F0, glottal spectral tilt and vocal intensity) and segmental (e.g., F1 for /i/ and /u/) features were altered in such environments. The analysis from this study suggests that modification of speech production of CI users under the Lombard effect may contribute to some degree an intelligible communication in adverse noisy environments.


I. INTRODUCTION A. Speech produced in noisy environments
Changes in speech production resulting from auditory feedback are an important research domain for improving interpersonal as well as human-to-machine communications.For example, in the presence of noise, a speaker experiences a natural phenomenon known as the Lombard effect (Lombard, 1911;Lane and Tranel, 1971;Hansen, 1988;Junqua, 1996;Lu and Cooke, 2008;Garnier et al., 2010).This phenomenon is physiologically realized with altered vocal efforts, such as an increase in vocal intensity, fundamental frequency, glottal spectral slope, or formant structures.The speech production modification help to maintain speech intelligibility during a conversation in challenging listening environments.
The Lombard effect has been widely studied in automatic speech systems, where it is known to degrade the quality of automatic speech recognition (ASR) and speaker identification (SID) systems (Junqua, 1992;Hansen, 1996;Hansen and Varadarajan, 2009;Bo ril and Hansen, 2010).Since there are fundamental differences in Lombard speech as compared to neutral speech, Lombard speech greatly contributes to breakdown in speech system performance when systems are modeled with neutral speech.A range of signal processing techniques have been proposed to compensate for the Lombard effect in speech to improve the robustness of the speech systems (Hansen, 1988(Hansen, , 1996;;Bou-Ghazale and Hansen, 2000;Bo ril and Hansen, 2010).Although well documented for normal hearing listeners as well as automatic speech systems, the Lombard effect has received little, if any, attention for assistive hearing devices, such as cochlear implants (CIs).

B. Background and motivation
A CI is an electronic device that is surgically implanted in the inner ear which directly stimulates the auditory nerve fibers to provide a sensation of sound (Wilson et al., 1991;Loizou, 1999;Zeng et al., 2008).CIs help individuals with profound hearing loss to communicate by providing perceptual benefit as well as allowing auditory feedback in the speech production of CI users (Svirsky et al., 2000;Dorman and Spahr, 2006).The latter of which is important to maintain communication in widely varying naturalistic environments.
Previous studies have considered how restoration of auditory function with a CI may lead to changes in speech production (Hochmair-Desoyer et al., 1981;Kirk and Edgerton, 1983;Svirsky and Tobey, 1991;Svirsky et al., 1992;Matthies et al., 1996;Vick et al., 2001;Lane et al., 2007).Kirk and Edgerton (1983), for example, examined the suprasegmental properties of four postlingually deafened adults who received a House single channel processor.They reported nearly normal fundamental frequencies in speech production of two male and two female postlingually deafened adult recipients of implants.Hochmair-Desoyer et al. (1981) also suggested improved quality of speech production for adventitiously deaf adults who were implanted with the 3M/Vienna processor.In that study, changes in vowel production and fundamental frequency of speech were observed when compared to subjects' speech characteristics before implantation.
In addition to longitudinal studies, some groups have investigated more precise roles of auditory feedback which affect speech production.Svirsky and Tobey (1991), for example, suggested that auditory feedback plays a calibration role in the control of speech production.According to their research, rapid changes in vowel formant frequencies were observed within a few seconds of turning the speech processor on or off.It was also argued by Svirsky et al. (1992) that many suprasegmental features, including pitch period and vowel duration, demonstrated instantaneous response to the short-time deprivation of auditory feedback.More recently, a number of studies have confirmed the dual role of auditory feedback in speech production of postlingually deafened CI users (Matthies et al., 1996;Vick et al., 2001;Lane et al., 2007).These findings suggest that, auditory feedback is not only used to calibrate the segmental features of speech in the long-term (e.g., pre-and post-implantation), but is also used to regulate suprasegmental features in the short-time domain (e.g., from a few seconds to minutes).
The studies cited above establish that speech production is heavily influenced by any form of auditory feedback.However, changes in speech production when auditory feedback is artificially distorted (e.g., turning processor on/off) does not necessarily provide equivalent scientific understanding about speech production in real communication conditions, such as noisy environments.To our knowledge, no study has examined the effect of background noise on speech production in CI individuals, or demonstrated the Lombard effect in voice communication of CI users.

C. Objectives and methods
The objective of this study was to analyze and model speech production of CI users with respect to environmental noise conditions.In addition, the study aimed to investigate the effect of auditory feedback on speech production in naturalistic acoustic environments.We observe this effect using mobile personal audio recordings from continuous singlesession audio streams collected over an individual's daily life.Prior advancements in this domain include the "Prof-Life-Log" longitudinal study at UT-Dallas (Sangwan et al., 2012;Ziaei et al., 2012Ziaei et al., , 2013) ) which explored speech communication in naturalistic daily life.
In order to analyze the effect of naturalistic noise on speech production, a total of six postlingually deafened adult CI users participated in this study.They were asked to produce spontaneous speech in various listening environments on a college campus.Analysis of speech production was accomplished using: (i) characteristics of acoustic environments, (ii) evaluation of subject's listening environment, and (iii) acoustic and phonetic properties of speech production in relation to the listening environment.In the first analysis, various approaches were used to characterize real-world environments, e.g., long-term average spectra, modulation spectra, and noise sound-pressure level (SPL).The second part of the analysis focused on objective metrics to predict speech quality, namely, signal-to-noise ratios with and without the Lombard effect.Last, the parametric variations in vowel, consonant and individual phoneme production were investigated as a function of varying environments.This involved speech SPL, fundamental frequency, glottal spectral tilt, phoneme duration, and formant frequencies.The analyses outlined here will explore relationships of speech production features of CI users upon varying noise/environment types.

A. UTD-CI-LENA corpus development
In order to investigate the influence of auditory feedback on speech production, a corpus was developed.This corpus included audio streams of CI participants from their daily lives.The LENA (for "Language ENvironment Analysis") device was used for collecting naturalistic audio from CI users (Gilkerson and Richards, 2008;LENA Foundation, 2014).The LENA device is a lightweight compact digital audio recorder that is capable of capturing mono audio data continuously for up to 16 h.The device was worn by each subject, and captured the participant's daily acoustics, including voice communication and interaction with other people, as well as environments (e.g., noise level and type).Figure 1(a) demonstrates how the device was positioned for collecting the audio data using the LENA unit.A cross pack was used to hold the device inside a pocket made of meshedmaterial for secure and consistent placement.The device was positioned at the center of the chest where it was held stationary with respect to the subject's mouth (approximately 15-20 cm).Such a set-up made it possible for the unit's microphone to detect the acoustic signal more robustly and consistently (across subjects) against environmental noise/reverberation during data collection.
A total of 6 CI speakers (mean age: 65 yr) who were fitted with the Nucleus devices from Cochlear Ltd. participated in the study.All participants were postlingually deaf (lost hearing after the age of 18) and had been regularly using their CI devices for at least four years.Among the participants, five were bilateral, and one was a unilateral CI users.The unilateral CI user wore a hearing aid in the contralateral ear.It should be noted that bilateral listeners are expected to take advantage of head-shadow effect which offers improved sound localization versus a single implant.This in general, provides better speech perception in noise when the target signal is spatially separated from the masker.Detailed biographical information of the CI participants is presented in Table I.In addition, the same number of normal hearing (NH) speakers (mean age: 37 yr) participated in the study as a pair-wise conversational partner.The CI speakers in this research acted as the primary speaker, while the NH listeners served as the secondary speaker/listener.Note that the objective of this study was to analyze the speech production of CI users in different noisy environments.Both CI and NH subjects were native speakers of American English, and included an equal number of male and female participants in each group.Naturalistic audio recordings were obtained in six environments on the UT-Dallas college campus.These included (i) office/lab, (ii) hallway, (iii) lobby, (iv) outside on campus, (v) college cafeteria, and (vi) college gameroom, as shown in Fig. 1(b).The locations were chosen to provide a diverse range of noise conditions, for example, type, mixture, and the level varied greatly across environments.Table II summarizes the six naturalistic environments employed for data collection.This consists of (i) general room size, (ii) the number of people typically present during the day, (iii) average SPL, (iv) reverberation time (RT 60 ), and (v) room description.The "people" listing here refers to the typical number of subjects within that room/space.The "average SPL" was determined by calibrating the average noise intensity measured in PRAAT software (Boersma, 2002) and converted to dB SPL scale.The "reverberation time (RT 60 )" refers to the length of time to decay by 60 dB from its initial level of impulsive excitation.The impulsive sound used in the calculation was created by a balloon burst recorded during off-hours in the respective locations.
All audio recordings were collected when on campus population was expected to be consistent with daily conditions.This included data collection on weekdays (Monday to Friday) during normal working hours (10 am to 1 pm) or (1 pm to 4 pm).In order to perform data collection, 3 min of background noise was first recorded prior to the subject's speech production in each location.These noise-only audio segments were then used to assess subject's listening environments in subsequent data analyses.Following the background noise recording, subjects were asked to perform free conversation between each other for 5 min in each location.A list of topics were provided to participants as a suggestion before the test, which included general topics, such as sports, news, weather, movies, etc.The overall data collection period for each subject was about 2 h.All subjects were informed that they always have the option to pause the audio recording anytime if they might be in a situation where privacy or confidentiality concerns arise during the recording session.However, no interruptions were experienced by any participants.

B. Post-processing
A set of acoustic and orthographic transcription labels were assigned to the collected audio data.The audio streams consisted of two acoustic categories, namely, silence and spontaneous speech in each location.Labeling tasks were first performed by a human annotator based on events in that space.For example, sound events in the office space were different than outside in the public area.In order to produce orthographic transcription, every single isolated utterance (e.g., sentence, phrase, word, and syllable) was first identified manually.Sentence level transcripts were then applied to each identified utterance based on listening to each individual audio file.Additional acoustic labels such as environments (office, hallway, outside, etc.), and speech types (silence, spontaneous) were applied manually to all recordings.In order to reduce inter-labeler variability, only a single annotator, a native speaker of American English, performed all data transcriptions collected from different speakers.
Following the manual labeling tasks, phoneme-level transcription labels were assigned.This task was done automatically by forced speech recognition alignment.Forced alignment is the process of taking the audio file and its orthographic transcription as input to produce word and phone boundary labels.Several other recent studies have successfully used forced alignment as a tool in phonetic research (Lu andCooke, 2008, 2009;Yu et al., 2014).In this study,  an open-source software P2FA was employed for this procedure (Yuan and Liberman, 2008).Following the automatic alignment process, words and phonemes that were shorter than 40 ms in duration, and had fewer than 200 instances were excluded from analysis.A total of approximately 36 000 words including 38 000 vowel and 54 000 consonant nuclei were identified to be analyzed.It is important to note that due to the limited number of contexts, individual phoneme instances should not be regarded as prototypical.

C. Signal processing: Features and metrics
Acoustic characteristics of the background noise were considered by investigating the (i) long-term average spectrum, (ii) average modulation spectrum, (iii) noise SPL, (iv) spectral centroid, and (v) average modulation spectrum energy.It is important to mention here that the noise analysis was carried out on "noise-alone audio segments" which were collected in each environment prior to conversation.None of these samples contained any speech.The long-term average spectrum was obtained by averaging short-time power spectral estimated by the Welch's method.The spectral centroid was calculated based on the average frequency weighted by amplitudes, divided by the sum of the amplitudes.The duration of the analysis window was set at 100 ms with a 50% overlap for each measurement.
We calculated the average modulation spectra to obtain a better understanding of overall room acoustics.The modulation spectrum represents the slowly varying temporal envelope components of signal, thereby providing a degree of acoustic spectral stationarity.Noise samples from each environment were divided into 2-s segments with 1 s time intervals.The modulation spectrum was computed by taking Fourier transform of the Hilbert envelope for each segments.The 0-20 Hz components were then added together across all segments.Finally, the average between 0 and 20 Hz was computed and considered as the average modulation spectrum energy.
Following the noise characteristics, the individual's listening environments were characterized by estimating signal-to-noise ratios (SNRs).In this study, two SNR approaches were employed, which include SNR (i) with neutral speech (SNRN) and (ii) with Lombard speech (SNRL).In these measurements, the assumption is that the office environment is a quiet baseline, and speech produced in this location will be neutral.SNRN was defined as the energy ratio of neutral speech to noise energy in each environment, which is assumed to be without the Lombard effect as follows: And SNRL was calculated from the energy ratio of Lombard speech to the corresponding background noise for each environment as follows: where E is the average energy.For these calculations, acoustic boundary detection was employed for separating speech from background noise.The leading and trailing silent intervals derived from each audio stream served as noise segments in each location for computing SNRs.These metrics shows the following two observations: (i) if there is a change in SNR due to noise (SNRN) and (ii) the level to which the decreased SNRs recover due to the Lombard effect is observable and measurable (SNRL).
In addition to noise/environment characteristics, various acoustic and acoustic-phonetic features for speech production were analyzed.These include (i) average speech SPL, (ii) fundamental frequency (F0), (iii) overall spectral tilt, (iv) phoneme duration, (v) first formant frequency (F1) location, and (vi) second formant frequency (F2) location.All measurements except phoneme duration and overall spectral tilt were computed using PRAAT software (Boersma, 2002).The average SPL measurements used here were similar to the metrics employed in the noise analysis, so to ensure connected scales between the measurements when reporting intensity.Phoneme duration was obtained from analysis of the phoneme-level transcripts.Spectral tilt was calculated from the difference between the magnitudes of the first spectral harmonic (H1) and that of the third formant peak (A3), i.e., H1-A3 (Hanson, 1997;Iseli et al., 2007) via PRAAT capability.It should be noted here that the focus was to compute an overall spectral slope.A related study by Hansen (1988) demonstrated changes in glottal spectral slope for various types of speech under stress by averaging individual frame spectral slopes of voiced speech over multiple utterances.Since the focus here is on overall speech content, that approach was not employed.
Acoustic-phonetic features were extracted based on phoneme nuclei boundaries marked by a forced phoneme alignment process.The beginning and ending markers of each phoneme were reduced by 20% to eliminate any transitional effects across phoneme classes.The duration of the analysis window used here for both speech and noise measurements was set to 20 ms with a 10 ms skip rate.After the feature extraction, normalization procedures were applied to reduce speaker-particular effects from the data (e.g., baseline F0 differences across male and female talkers).This was achieved by scaling each feature to have the same overall level across speakers.Last, a repeated-measures analysis of variance (ANOVA) was performed to assess the effect of environment type on noise/speech features, and determine statistical significance of differences between speech produced in neutral and Lombard conditions.Subjects were considered as a random (blocked) factors, while environment conditions were used as the main analysis factors.Following the ANOVA, a post hoc pairwise comparison test was performed to determine if the noisy conditions were significantly different from the quiet baseline.Bonferrroni adjustment was used to control for family-wise error in the pairwise test.In this study, a difference in means between two or more groups was considered significant if the significance level fell below 5.0% (p < 0.05).

A. Noise/environment analysis
Prior to any analysis, it is important to understand the characteristics of the acoustic/listening environments.This section offers some level of baseline knowledge regarding each environment's acoustic characteristics as well as how this may relate to speech perception by CI users in these environments.

Noise characteristics
Figure 2(a) shows the long-term averaged spectra of various maskers.It can be seen that the office environment has the least spectral impact in terms of overall spectral energy as compared to other noisy environments, which is why it was chosen as a baseline in this study.Spectral energies in general were highly concentrated in the lower frequency range (<2 kHz) for all environments.When compared to the office baseline, a progressive increase in spectral energies (from hallway to gameroom) was observed in all noisy environments based on the increasing complexity of acoustic space.
This can be better visualized from Figs. Figure 2(b) shows the change of the average modulation spectrum as a function of environments.The modulation spectrum energy between the modulation frequencies of 0-20 Hz are presented.All the noise signal has a distinct modulation spectrum with a peak at 0 Hz.All the conditions, with the exception of office environment, had a similar modulation pattern.The modulation spectrum energy between 2 and 20 Hz increased when complexity of noise increased (i.e., with an increase in the number of people, noise SPL).This effect can be better visualized from Fig. 3(c), which shows the average modulation spectrum energy of the tested environments.From this figure, we confirmed the increasing tendency in an explicit way.This suggest that such modulation analysis provides an objective measure of stationarity.This phenomenon is a key characteristic of the Lombard effect.For example, consider the gameroom environment where SNRN was 3 dB.The Lombard effect here helped to increase overall SNR up to 11 dB, thus the corresponding benefit to include the Lombard effect in this environment (gameroom) was þ 7.9 dB.The Lombard effect demonstrated here could boost the perceived SNR levels, and thereby facilitate in auditory decoding for the two-way conversations of CI users in noisy conditions.

B. Speech production analysis
In this section, we consider methods for analyzing speech production characteristics as a function of varying environment.Note that in this section, we again established the office environment as the quiet baseline (<45 dB SPL), assuming speech production in this location to be neutral.vowel phonemes.Note that the asterisk marked at respective data points indicate statistical significance as compared to the office baseline (p < 0.05).The results indicate that both vowel SPL and F0 varied across conditions.Average values of both features increased significantly for outside, cafeteria, and gameroom environments (p < 0.05).However, only little changes occurred in vowel SPL and F0 for hallway and lobby environments (p > 0.05).Two groups emerged for both features: a high-value group which was significantly different from office baseline (p < 0.05) and a low-value group with no statistically significant difference from baseline (p > 0.05).The low value group comprised of the hallway and lobby conditions, while the high value group included outside, cafeteria, and gameroom environments.No significant differences were found between the conditions which belonged to the same group.
Figures 5(c) and 5(d) present the variation of spectral tilt and vowel duration, respectively, with respect to each environment.Spectral tilt was found to progressively reduce with increasing noise SPL.From the baseline office to the gameroom environment, the mean spectral tilt fell from 19 to 14 dB; with other environments falling within this range.A significant effect of environment type on spectral tilt was observed between gameroom and office baseline (p < 0.05).However, no statistical significant differences were found for any combinations of the remaining four conditions (hallway, lobby, outside, cafeteria) (p > 0.05).For vowel duration, variations in mean were found in the presence of noise.As shown in Fig. 5(d), average vowel duration decreased progressively with noise complexity; however, it was only significantly different from the baseline condition for the gameroom environment (p < 0.05) only.Hallway, lobby, outside, and cafeteria environments did not result in a statistically significant change in the vowel duration as compared to the office baseline.
Figures 5(e) and 5(f) display the variation in consonant SPL and duration, respectively, with respect to noise SPL.In general, both features were altered by speakers under noisy environments.For consonant SPL, the results followed the similar pattern to vowel SPL.Two distinct groups were found, the high value (outside, cafeteria, and gameroom) and low value group (hallway and lobby).The high value group increased significantly from the office baseline (p < 0.05), while the low value group resulted in little or no change across both environments (p > 0.05).No significant pairwise differences were found between the conditions which belonged to the same group.For consonant duration, mean values monotonically decreased for most noisy conditions.However, there were no statistically significant difference in mean between quiet and all noisy conditions (p > 0.05).Speech produced in all noise environments had slightly shorter consonant duration than that of the baseline office environment.

Vowel-consonant ratios
Additional analyses were conducted on global shifts in acoustic features between individual phoneme classes.It has been previously demonstrated with NH listeners that a talker could maintain overall intensity, yet emphasize consonant phoneme class with respect to vowel class (House et al., 1965;Hansen, 1988).Hansen (1988Hansen ( , 1996) ) also suggested that consonant duration increases at the expense of vowel duration in an effort to increase speech intelligibility under noise.In the present analysis, two different ratios were considered: (i) a consonant versus vowel intensity ratio (CVIR) and (ii) a consonant versus vowel duration ratio (CVDR) (House et al., 1965;Hansen, 1996).Vowel and consonant intensities were computed using PRAAT software (Boersma, 2002), and phoneme duration was obtained from analysis of the phoneme-level transcripts.These ratios indicate how energy or duration between vowel and consonant speech classes changes under noisy conditions.
A pictorial representation of global shifts between individual phoneme classes is presented in Figs.6(a) and 6(b).For each figure, shaded regions within each bar graph indicate average SPL/duration values for vowel and consonant phoneme classes.The asterisk indicates statistically significant shifts in mean based on the measures of CVIR and CVDR.Consider the CVIR first, where increased CVIRs resulted for most noisy conditions.The outside, cafeteria and gameroom conditions were significantly different from the office environment (p < 0.05).These particular changes in CVIR demonstrate increased consonant intensity as compared to vowel intensity in noisy condition.No significant differences were found for hallway and lobby as compared to office location (p > 0.05).For CVDR, no statistically significant shift in duration between vowel and consonant phoneme classes was observed in any locations (p > 0.05).It should be noted that consonant duration with respect to vowel duration plays a crucial factor in listener's ability to perceive the speech in the presence of noise (Hansen, 1988(Hansen, , 1996)).

Vocal tract characteristics
Thus far, the investigation of the Lombard effect was focused on analysis of source excitation features.Speech SPLs, pitch, spectral tilt and duration are all controlled in some manner by the supra-glottal and glottal systems.It is reasonable to hypothesize that noise also affects the articulators that configure vocal tract shape.In order to investigate vocal tract shape, we considered only the vowel space, since it controls the analysis to fixed articulator positioning versus the more complex time varying requirements for liquids, glides, and diphthongs.For statistical reliability, phonemes with sufficiently large sample sets were considered.Four cardinal vowel nuclei, /a/, /ae/, /i/, and /u/, were chosen for this analysis.Phoneme-level transcription labels identified boundary information of each phoneme.On average, more than 3000 instances of each phoneme were employed.
Figure 7 illustrates the vowel space for phoneme nuclei, /a/, /ae/, /i/, and /u/, under various noisy conditions.Figure 7(a) includes hallway and lobby, while Fig. 7(b) contains outside and cafeteria.Gameroom condition is shown in Fig. 7(c).Office baseline results are also provided in each figure for comparison.The abscissa in each figure denotes F1 formant locations, while the ordinate denotes F2 formant locations.Overall, F1 formant locations changed with respect to environment, but F2 formant location did not.Consider the front vowel /i/ first.F1 formant frequencies for /i/ phoneme significantly increased for outside, cafeteria and gameroom speech (p < 0.05).The pattern of the results was consistent with many other acoustic features (e.g., vowel/consonant SPLs and F0).However, this was not the case for the F2 feature.For the phoneme /i/, no significant changes occurred for F2 formant locations across all environments for all speakers (p > 0.05).
Similar changes were observed for the remaining three vowel phonemes, /a/, /ae/, and /u/.In these phonemes, shifts in formant locations generally followed the trend observed for the /i/ phoneme.For example, significant increases in F1 formant location for /u/ were found for environments: outside, cafeteria and gameroom conditions versus baseline office condition (p < 0.05).F1 formant location for /a/ and /ae/ phonemes were changed significantly for gameroom condition (p < 0.05).However, similar to the /i/ phoneme, F2 formant location also resulted in relatively small changes for /a/, /ae/, and /u/ phonemes across all environments (p > 0.05).Thus, F2 locations were not a major factor for changes in speech production in noisy environments.It should be noted that F2 formant frequency plays a critical role in speech comprehension of NH as well as CI users in noisy environments (Loizou, 2013).

C. Summary of Lombard speech features
In this study, analysis of speech production in six naturalistic environments was presented along with the Lombard effect observed in speech produced by CI users.Statistical analysis techniques were employed to determine if any changes in speech features are reliable Lombard relayers.The results indicate that many speech features were used by the CI users in demonstrating Lombard effect stress condition.However, due to limited dataset and environments, it may be difficult to identify which specific features are sufficiently sensitive, and statistically reliable indicators of the Lombard effect.In order to identify these features, we grouped similar listening conditions together.Here, the environmental conditions were grouped into two areas: low and high noise groups.Low noise group includes office, hallway, and lobby, and high noise group includes outside, cafeteria, and gameroom.The decision was made based on average noise SPL [Fig.3(a)] and average speech SPLs [Figs. 5(a) and 5(e)].Next, ANOVA was performed on the two groups.Therefore, this section allowed for a manageable summary of important Lombard relaying features from naturalistic audio recordings by fixing two conditions.
IV. DISCUSSION

A. The Lombard effect in CI users
Many speech production features identified in this study were observed to change in the presence of background noise.It was found that postlingually deafened CI users modify both segmental and suprasegmental properties of their speech in different listening environments.These features serve to calibrate speech production (i.e., the speaker monitors their relations between his/her own phonemic intention and their acoustic output in the presence of noise).Moreover, it also influences speech production along an instantaneous basis, thus speakers modulate at least some suprasegmental features of their ongoing speech gesture.The results from this study indicate that CI users exhibit the Lombard effect in adverse listening environments.
Many investigators have suggested that auditory information for NH listeners may be used to modulate at least some suprasegmental features of speech production under noise (Pisoni et al., 1985;Hansen, 1988;Summers et al., 1988;Junqua, 1992;Hansen, 1996;Lu and Cooke, 2008;Garnier et al., 2010).In these studies, results showed an increase in overall amplitude of vocalic sections, increased duration, increased average F0, and a decreased spectral tilt.This was presumed to be the result of increased subglottal pressure, and vocal-fold tension as a response to the reduced auditory feedback due to noise.Existing studies suggest that the most widely considered area of the Lombard effect involves vocal intensity and F0 (Hansen, 1996;Lu and Cooke, 2009;Garnier et al., 2010).Spectral balance of vowels was affected by the higher vocal effort for Lombard speech, resulting in relatively greater intensity in the higher frequency bands of the spectrum (Hansen, 1988;Junqua, 1992;Lu and Cooke, 2008).
Interestingly, the results showed that vowel and consonant duration [Figs. 5(d) and 5(f)] decreased in most noisy environments, in contrast to the earlier studies.It is well established in NH study that Lombard speech has a generally increased phoneme duration in comparison to normal speech, and speech intelligibility in noise is associated with lengthening the phoneme duration (Junqua, 1996;Lu and Cooke, 2008;Garnier et al., 2010).We suggest that the main contribution to this difference is the conversational speaking style which was used in the data analysis.Participants in this study produced conversational speech in realistic scenarios, while the previous studies primarily focused on reading speech style with given sentences.Another possibility is that increasing phoneme duration is not necessary for maintaining high intelligibility.Several studies confirmed that other inherent temporal properties, such as temporal amplitude modulations and vowel-consonant duration ratios, may directly contribute to enhanced intelligibility rather than phoneme durations (Payton et al., 1994;Hansen, 1996;Krause and Braida, 2004).
In addition to suprasegmental variables, there has been a general consensus concerning the control of segmental features (Pisoni et al., 1985;Hansen, 1988;Summers et al., 1988).The rise in subglottal pressure needed to increase vocal effort leads to an increase in formant locations.For example, the wider jaw opening in order to increase sound amplitude causes an increase in F1 frequency (Huber and Chandrasekaran, 2006).It has also been suggested that under noisy conditions, speakers vary their speech characteristics so that speech segments rich in information are emphasized, while those less important to intelligibility are deemphasized (House et al., 1965;Hansen, 1988;Sodersten et al., 2005).For example, consonant energies increased at the expanse of vowel energy under noisy conditions in an effort to increase speech intelligibility (House et al., 1965;Hansen, 1988).This is a useful characteristic, as consonants carry more speech information in the presence of noise.
The consistency between the two speaker groups (CI versus NH) indicated above could be mainly due to the presence of auditory feedback provided by the CI device.Longterm absence of auditory feedback could potentially result in poor regulation of acoustic, phonetic features of adventitiously deafened adults, such as F0, intensity, duration, etc. (Leder et al., 1987;Lane and Webster, 1991).CI users, however, may demonstrate useful Lombard perturbation for regulating speech production features in noisy environments, which thereby assists in the development of more nearly neutral/typical acoustic, phonetic and temporal patterns under noise (Hochmair-Desoyer et al., 1981;Kirk and Edgerton, 1983;Svirsky and Tobey, 1991;Svirsky et al., 1992;Perkell, 2012).
The modification of speech production features under the Lombard effect may contribute to ensure intelligible communication in adverse noisy environments.The data from this study indicates that CI users respond to varying background noise types, and change their speech production accordingly.This articulatory modification allows speakers to avoid speech masked by the acoustic noise to compensate for the decreased SNR.Previous studies have reported that Lombard effect speech is more intelligible than speech under normal conditions (Summers et al., 1988;Junqua, 1996;Lu and Cooke, 2008;Garnier et al., 2010).In these studies, the intelligibility gain increased with increased in vocal effort (e.g., intensity, fundamental frequency, and spectral tilt) The data presented in this study suggest a potential perceptual benefit of the Lombard effect for CI users.

B. Future direction
The present study focused on the speech production of CI users in varying environment types.However, it does not address the nature and the extent of the Lombard effect as compared to the NH listeners.As a part of our future work, we suggest repeating the same data collection with NH individuals in the same environments to establish a one-to-one comparison of CI-NH pair and NH-NH pair.Moreover, while the present study focused on the role of auditory feedback provided in the context of CI systems, there has been no study of CI signal processing features that may play a role in auditory feedback.We feel further discussion on other CI sound processing factors, such as automatic gain control (AGC), adaptive dynamic range optimization (ADRO), frequency band sampling or virtual channels, represent a wider range of issues, which are beyond the current scope of this study.In theory, a supplementary study in future could explore these various factors within the context of the Lombard effect.
Furthermore, specific variations in speech production features due to the Lombard effect investigated here can be used to formulate new algorithms for improved intelligibility in noisy conditions.For example, strategies that exploit the impact of particular acoustic features of speech with respect to Lombard speaking style (Zorila et al., 2012;Godoy and Stylianou, 2013).Historically, it is known that different environments will have specific noise types and levels.Traditional front-end processing for hearing aids and CIs, for example, have focused on noise suppression to minimize the impact of noise.Algorithmic advancements which modify neutral speech based on Lombard effect properties offers a unique opportunity to improve the listening/decoding experiences of CI users.

V. CONCLUSIONS
In this study, we analyzed the speech production of CI users with respect to environmental context.Naturalistic human-to-human voice interactions were captured using mobile personal audio recordings from continuous singlesession audio streams collected over various realistic environments.An analysis of speech produced in noise and the Lombard effect observed in the speech of CI users was presented.The results indicated that speakers demonstrated increased vocal effort, including F0 and speech SPL, as well as altered glottal spectral slope, and phoneme duration in response to challenging noisy environments.Segmental articulatory movements, for example, F1 for specific phonemes such as /a/, /ae/, /i/, and /u/, also appeared to play an important role in relaying Lombard perturbation for speech produced in the presence of noise.The significance of the results is that the Lombard effect could potentially be helping CI users to ensure/maintain intelligible communication by compensating for the reduced SNR.The specific variations due to the Lombard effect can be leveraged for new algorithm development and further applications of speech technology to benefit CI users.

FIG. 1
FIG. 1. (Color online) Naturalistic data collection from CI subjects: (a) set-up for data acquisition using the LENA device and (b) naturalistic environments on UT-Dallas college campus for data collection.
3(a) and 3(b) which present the distribution of average noise SPL and spectral centroid for each environment.The results indicate that both features increased monotonically when switched from office to gameroom.The range of average noise SPL extended from approximately 42 (for office) to 67 dB (for gameroom), and all noisy environments had mean values which were significantly different from the office baseline (p < 0.05).Spectral centroid was almost always under 500 Hz for all conditions.With the exception of hallway condition, all noisy environments had mean SPL and spectral centroid values significantly different from the office baseline (p < 0.05).Hallway remained almost constant in terms of mean of the spectral centroid (p > 0.05).

Figure 4
Figure4illustrates the average SNR levels with respect to each environment.For each environment type, the bar on FIG. 3. (Color online) Acoustic characteristics of background noises: average (a) noise SPL, (b) spectral centroid, and (c) average modulation spectrum energy with respect to different environments.While average SPL shows changes in signal strength over time-domain, spectral centroid represents where spectral energy was concentrated in frequency-domain.Average modulation spectrum energy estimate the relative degree of stationarity for the noise signal.

FIG. 6
FIG. 6. (Color online) Pictorial representations of global shift in (a) SPL and (b) duration between vowel and consonant phoneme class.The speech class percentage is shown for each environment.Asterisks indicate significant shift in intensity/duration based on phoneme ratios.

FIG. 7
FIG. 7. (Color online) Spectral characteristics of vocal-tract: plots of first formant frequency F1 versus second formant frequency F2 for vowel phonemes /a/, / ae/, /i/, and /u/ with respect to (a) hallway and lobby, (b) outside and cafeteria, and (c) gameroom environments.Office result was given in each plot for comparison.

TABLE I .
Characteristic information of CI subjects who participated in UTD-CI-LENA corpus development.

TABLE II .
Summary of naturalistic environments used in this study on UT-Dallas campus.