A corpus of audio-visual Lombard speech with frontal and proﬁle views

: This paper presents a bi-view (front and side) audiovisual

a) Author to whom correspondence should be addressed.

12
The Lombard effect (Lombard, 1911) is a reflexive adaptation to speech production which 13 occurs when communicating in adverse conditions. Lombard speech is characterized by cation modality (Fitzpatrick et al., 2015), variables which typically vary from one study to 33 the next.

34
This paper aims to provide a more detailed characterisation of the across-speaker 35 variation in the Lombard effect by collecting and analysing a corpus of plain and Lombard 36 speech from a total of 54 speakers uttering a total of 5400 utterances. The amount of data 37 collected significantly exceeds that used in previous controlled Lombard studies. It is also 38 the first collection that has been designed with precise video analysis in mind. In particular, 39 the collection uses head-mounted cameras that allow highly accurate measurement of the 40 visual Lombard effect from both a frontal and profile view.

41
The data are being made publicly available for the benefit of other researchers. In 42 particular, the dataset is an extension of the audio-visual Grid corpus (Cooke et al., 2006) 43 that has been widely used in the study of speech intelligibility in noise and the perception 44 of simultaneous speech signals. The data are also suitable for development of novel speech 45 processing algorithms. In particular, the Lombard effect has major implications for the de-46 sign of automatic audio/audiovisual speech recognition systems. Such systems are typically 47 trained on clean speech datasets or on datasets to which noise has been artificially added.

48
The performance of these systems can then deteriorate under real Lombard conditions that 49 have not been observed during training. Although there are audio-video speech datasets 50 that have been recorded in noise, e.g., AVICAR (Lee et al., 2004), these datasets lack con-51 trolled non-Lombard reference signals against which to make accurate measurements of the 52 adaptation.  The sentences in the corpus conform to the Grid corpus syntax (Cooke et al., 2006). These 64 are six-word sentences, for example 'bin blue at A 2 please', with the following structure: 65 <command: bin, lay, place, set> <color: blue, green, red, white> <preposition: at, by, 66 in, with> <letter: A-Z (excluding W)> <digit: 0-9> <adverb: again, now, please, soon>.

67
Three of these words -color, letter, and digit -are considered to be "keywords," while the

111
In addition to the audio recordings, simultaneous audiovisual recordings were made 112 using a custom-made helmet rig system that was worn by the talkers. The system consisted order. Each block of 10 utterances was preceded by 5 'warm-up' utterances that were used to 134 allow talkers to attune to the change in condition (i.e., from noise present to noise absent and 135 vice versa). These initial utterances were discarded after recording. The Lombard-inducing 136 noise was controlled by a computer (using a MATLAB routine as previously described) and 137 was present throughout the Lombard blocks and turned off during the non-Lombard blocks.

138
The talkers read the sentences to the researcher, who acted as a listener. Having    Table 1 shows across-talker means and standard deviations (SDs). Paired-196 samples t-tests were employed to determine the significance of differences between the across-197 talker means, across-female-talker means, and across-male-talker means in plain and Lom-198 bard conditions. Table 1 also summarizes the results of the statistical analysis.   (Junqua, 1993). Junqua (1993) also found that Lombard speech produced in multi-talker 221 noise by female talkers is more intelligible than male talkers. Gender difference has also 222 been reported when the auditory feedback is delayed (Howell and Archer, 1984). This could 223 suggest that male and female talkers may differ in their strategic responses to the auditory 224 feedback that mediates the Lombard effect.      it should be acknowledged that the effect may be partly due to changes in alpha-ratio rather than changes 261 to the actual formants.  ↑ , decrease: ↓; All tests were significant (p < 0.001) except those marked with ⋆ (p > 0.5) F0 (semitones 0 → 27.5Hz) Vowels F1 (Hz) Vowels F2 (Hz)