Open Submitted: 21 October 2021 Accepted: 20 June 2022 Published Online: 07 July 2022
The Journal of the Acoustical Society of America 152, 266 (2022); https://doi.org/10.1121/10.0012350
more...View Affiliations
View Contributors
  • Diogo Pessoa
  • Lorena Petrella
  • Pedro Martins
  • Miguel Castelo-Branco
  • César Teixeira

This paper addresses the development of a system for classifying mouse ultrasonic vocalizations (USVs) present in audio recordings. The automatic labeling process for USVs is usually divided into two main steps: USV segmentation followed by the matching classification. Three main contributions can be highlighted: (i) a new segmentation algorithm, (ii) a new set of features, and (iii) the discrimination of a higher number of classes when compared to similar studies. The developed segmentation algorithm is based on spectral entropy analysis. This novel segmentation approach can detect USVs with 94% and 74% recall and precision, respectively. When compared to other methods/software, our segmentation algorithm achieves a higher recall. Regarding the classification phase, besides the traditional features from time, frequency, and time-frequency domains, a new set of contour-based features were extracted and used as inputs of shallow machine learning classification models. The contour-based features were obtained from the time-frequency ridge representation of USVs. The classification methods can differentiate among ten different syllable types with 81.1% accuracy and 80.5% weighted F1-score. The algorithms were developed and evaluated based on a large dataset, acquired on diverse social interaction conditions between the animals, to stimulate a varied vocal repertoire.
Ultrasonic vocalizations (USVs), also known as syllables, produced by rodents have gained increasing attention as potential biomarkers in social behavior studies.1,21. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.00030672. M. Wöhr and R. K. W. Schwarting, “ Affective communication in rodents: Ultrasonic vocalizations as a tool for research on emotion and motivation,” Cell Tissue Res. 354(1), 81–97 (2013). https://doi.org/10.1007/s00441-013-1607-9 Specifically, in mouse models of autism spectrum disorders, the study of USVs allows the characterization of associated phenotypes: communication deficits and reduced social interaction.22. M. Wöhr and R. K. W. Schwarting, “ Affective communication in rodents: Ultrasonic vocalizations as a tool for research on emotion and motivation,” Cell Tissue Res. 354(1), 81–97 (2013). https://doi.org/10.1007/s00441-013-1607-9 Mice can produce a wide repertoire of sounds, and several USV types have been identified in the literature.3–53. J. B. Panksepp, K. A. Jochman, J. U. Kim, J. J. Koy, E. D. Wilson, Q. Chen, C. R. Wilson, and G. P. Lahvis, “ Affiliative behavior, ultrasonic communication and social reward are influenced by genetic variation in adolescent mice,” PLoS One 2(4), e351 (2007). https://doi.org/10.1371/journal.pone.00003514. M. L. Scattoni, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in adult BTBR T+tf/J mice during three types of social encounters,” Genes Brain Behav. 10(1), 44–56 (2011). https://doi.org/10.1111/j.1601-183X.2010.00623.x5. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 However, the relation between the USV types and the associated social/emotional state remains largely unknown.22. M. Wöhr and R. K. W. Schwarting, “ Affective communication in rodents: Ultrasonic vocalizations as a tool for research on emotion and motivation,” Cell Tissue Res. 354(1), 81–97 (2013). https://doi.org/10.1007/s00441-013-1607-9
Mice emit a wide and complex multi-syllabic repertoire.6,76. J. Chabout, A. Sarkar, D. B. Dunson, and E. D. Jarvis, “ Male mice song syntax depends on social contexts and influences female preferences,” Front. Behav. Neurosci. 9, 76 (2015). https://doi.org/10.3389/fnbeh.2015.000767. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6 They produce both low- and high-frequency vocalizations, going from audible sounds (lower than 20 kHz) up to frequencies of about 200 kHz.88. M. L. Dent , R. R. Fay , and A. N. Popper, Rodent Bioacoustics ( Springer, New York, 2018). https://doi.org/10.1007/978-3-319-92495-3 The way USVs are defined varies among authors.1,5,7,91. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.00030675. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.00174607. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-69. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386 Also, even for the same USV type, the time-frequency (TF) characteristics can vary among authors. For instance, in Ref. 11. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.0003067, the authors define an USV as upwardly modulated in frequency when there is a frequency variation equal to or greater than 12.5 kHz. However, in Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460, the authors define that same class when the upward frequency variation is equal to or greater than 6 kHz. Moreover, USV types may, at least in part, differ between rodent strains, age, and sex. In laboratory behavioral tests, rodents may vocalize hundreds of syllables per minute, making it hard to manually analyze the recording and annotate all of them. All these factors make the USV detection and classification process a complex task, requiring the development of powerful and automatic algorithms, being the motivation of the present study. Therefore, automatic processing of these sounds is also crucial to make the analysis of large-scale studies viable.
In this work, we present a complete pipeline for the analysis of USVs, namely, algorithms for their detection/isolation and classification. The pipeline is based on commonly used machine learning methods. Three important conclusions can be highlighted as the main outcomes of this work: (i) a new segmentation algorithm based on spectral entropy analysis; (ii) introduction of a new set of features in the area, the contour-based features; and (iii) the possibility to classify a greater number of classes as compared to similar state-of-the-art studies.
In bioacoustics analysis, the first step usually is the segmentation of individual syllables. One of the major hurdles in automatically detecting USV is the relatively low signal-to-noise ratio (SNR), as USV signals are often contaminated by broadband interference or ambient noise, which can partially mask the signals. However, to address this task, several algorithms have been proposed in the literature, as listed in Table I.
Table icon
TABLE I. Summary table of the different audio segmentation methods.
MethodAuthors
Object detection neural networksCoffey et al. (Ref. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6)
VocalMat (computer vision)Fonseca et al. (Ref. 1010. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161)
Varying parametersVan Segbroeck et al. (Ref. 1111. M. Van Segbroeck, A. T. Knoll, P. Levitt, and S. Narayanan, “ Mupet–mouse ultrasonic profile extraction: A signal processing tool for rapid and unsupervised analysis of ultrasonic vocalizations,” Neuron 94(3), 465–485 (2017). https://doi.org/10.1016/j.neuron.2017.04.005)
USVSEG (multitaper spectrogram generation)Tachibana et al. (Ref. 1212. R. O. Tachibana, K. Kanno, S. Okabe, K. I. Kobayasi, and K. Okanoya, “ USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents,” PLoS One 15(2), e0228907 (2020). https://doi.org/10.1371/journal.pone.0228907)
Time varying parametersHoly and Guo (Ref. 99. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386)
Segbroeck et al. proposed an USV segmentation method based on several parameters, namely, noise reduction, minimum and maximum syllable duration, minimum total and peak syllable energy, and the minimum inter-syllable interval that is needed to separate rapidly successive notes into distinct syllables. Their method can be summarized in the following steps: (1) high-pass filtering the recordings to the ultrasonic range (25–125 kHz); (2) using spectral subtraction to remove the stationary noise in the recordings originating from background noise and recording equipment distortions; and (3) computing the power of spectral energy in the ultrasonic range that exceeds a noise floor threshold. The authors do not provide any metrics regarding the performance of their segmentation algorithm. In the DeepSqueak software,77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6 the segmentation process works with object detection neural networks (Faster-RCNN). To this end, the audio samples are converted into sonograms, which are then passed to the object detector network. VocalMat1010. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161 uses multiple steps to analyze USVs in audio files. Initially, the audio recordings are converted into high-resolution spectrograms through a short-time Fourier transform (STFT). Then image processing techniques such as binarization and morphological operations are used to isolate the USVs. Last, Tachibana et al.1212. R. O. Tachibana, K. Kanno, S. Okabe, K. I. Kobayasi, and K. Okanoya, “ USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents,” PLoS One 15(2), e0228907 (2020). https://doi.org/10.1371/journal.pone.0228907 proposed a five-step method to isolate USVs: multitaper spectrogram generation, flattening, thresholding, detection of syllable onset/offset, and spectral peak tracking.
Even though there are several approaches available to study and analyze USVs, most of them take a non-supervised approach, mostly based on clustering methods. This means that these tools do not attribute a specific label to each USV. Instead, with clustering, it is possible to group the USVs into different clusters, or groups, based on the similarity of a certain set of descriptors, trying to maximize the inter-clustering separability and minimizing the intra-clustering variability.
Despite being mainly a tool with clustering capabilities, the DeepSqueak software developed by Coffey et al.77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6 combines clustering capabilities with supervised classifiers, namely neural networks. These networks can distinguish between five different types of vocalizations (split, inverted U, short rise, wave, and step). The authors acknowledge that the five classes that their system can identify are by no means comprehensive enough to describe the full spectrum of sounds produced by mice. Moreover, they were able to recognize up to 20 different types of syllables using their clustering methods. Holy and Guo99. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386 were also able to categorize the same five different syllables identified with DeepSqueak; however, none of their works presented classification metrics for algorithm performance assessment. In both works, the number of syllables categorized is far from comprehensive enough, as the authors acknowledge.
More recently, in Ref. 1313. A. P. Vogel, A. Tsanas, and M. L. Scattoni, “ Quantifying ultrasonic mouse vocalizations using acoustic analysis in a supervised statistical machine learning framework,” Sci. Rep. 9(1), 8100 (2019). https://doi.org/10.1038/s41598-019-44221-3, a supervised classification framework is presented, being able to discriminate among nine syllable types. Here, two classification models were used, support vector machines (SVM) and random forest (RF). The authors have used Avisoft SASLab Pro (a commercial software provided by Avisoft Bioacoustics e.K., Glienicke/Nordbahn, Germany) to extract the features used for the classification models. The best classification accuracy was obtained with the SVM model (88 ± 6.4%). However, one of the biggest limitations of the study was the reduced number of USVs that were analyzed (only 25 samples per syllable class, with 225 in total). Also, in Ref. 1010. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161, the authors have proposed a convolutional neural network (CNN) model to perform USV classification. They use a transfer learning approach with an AlexNet model pre-trained on the ImageNet data set. The last three layers of the network were replaced to handle a 12-category classification task (11 USV types + noise). Most of the USV classes used in this work1010. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161 matched the ones used in our work. However, while we define two classes for the step-type USVs, one frequency step and multiple frequency steps, Fonseca et al.1010. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161 subdivided them into more categories, such as step up and step down, among others. They reported a classification accuracy of 95.28%.
A. Signal acquisition setup
A multi-channel recording system (Avisoft-UltraSoundGate 416H, Avisoft Bioacoustics) with two microphones (microphone CM16/CMPA, Avisoft Bioacoustics) was used for USV acquisitions (see Fig. 1). Tests were conducted in a sound-attenuating cabin, consisting of 1.5 cm thick acrylic walls and covered by sound-absorbing foam, with 55 cm × 50 cm × 70 cm (height × depth × width). For the recordings, mice were placed into an acrylic cage, half separated by an acrylic plate with holes in the inferior part, thus allowing interaction. During the recordings, two animals are placed in the setup, one in each compartment, with one microphone per compartment. The tests were video recorded with a webcam positioned at the top of the test cabin and using a red light that enables video recording while minimizing effects on mouse activity. Moreover, a playback system, consisting of a speaker (Ultrasonic Speaker Vifa, Avisoft Bioacoustics) and a digital/analog (D/A) converter (UltraSoundGate Player 116 single channel, Avisoft Bioacoustics), was used to reproduce acoustic stimuli. The signals were digitized at a sampling frequency of 500 kHz. The associated software (Avisoft-RECORDER) enables the system configuration, the real-time signal/spectrum visualization, and the signal recording.
B. Dataset
Vocalizations from two mouse strains were recorded for analysis, the C57BL/6 and the transgenic mouse model of neurofibromatosis type I (Nf1+/ mice). Nf1+/ mice manifest some phenotypic characteristics of autism spectrum disorder. Mice were bred and housed at the Coimbra Institute for Biomedical Imaging and Translational Research, University of Coimbra, and the USV recordings were conducted at the same facilities. Mice were housed with their siblings, including between two and four animals per cage. They were under a 12-h light/dark cycle, and tests were conducted during the light cycle. Only one test was conducted per day, and all tests were conducted at the same day hour.
USV recordings were conducted at two different ages: in young mice (approximately 20 days in age) and in adult mice (approximately 60 days in age). Moreover, they also included males and females. Before the recording days, the animals were submitted to an adaptation period (5 days, about 1 h/day) for habituation to the recording room, the cage/cabin, and the operator; this procedure was repeated at the two ages. Diverse acquisition paradigms were applied in an attempt to induce diversified social behaviors and increase the repertoire. Below, a brief description of each paradigm is presented.
Play (young mice): A pair of mice (separated in individual cages about 5 h before the test) are introduced in the recording cage without physical separation for free interaction. Recordings are conducted for 10 min.
Male/female interaction (adult mice): A male is introduced in one side of the cage separated by an acrylic plate but communicated through holes and left 5 min for habituation. Then a female is introduced in the other cage compartment. Recordings were conducted 2 min before and 8 min after introducing the female.
Aversive smells (young and adult mice): A pair of mice are introduced into the recording cage separated by the acrylic plate for habituation for about 10 min. Then the record is started, and 2 min later, an aversive smell was introduced (with a cotton socked in benzaldehyde1414. L. R. Saraiva, K. Kondoh, X. Ye, K.-H. Yoon, M. Hernandez, and L. B. Buck, “ Combinatorial effects of odorants on mouse behavior,” Proc. Natl. Acad. Sci. U.S.A. 113(23), E3300–E3306 (2016). https://doi.org/10.1073/pnas.1605973113). The total recording time was 10 min.
Resident/intruder (adult mice): One adult male is placed in the test cage and left about 30 min for habituation (resident). Then a strange adult male is introduced in a small steel cage (intruder). From this test, it is expected to induce aggressive behavior in the resident and a fear reaction in the intruder.
Anticipation: In this test, the food is withdrawn from the cage ±3 h before the test. At the test moment, a pair of familiar mice is put in the cage. After 120 s of recording, the acoustic stimulus is reproduced, and after a 15 s interval, a piece feed is introduced in each compartment. Mice are trained for 3 days, and the anticipation is assessed at the fourth day, following the same procedure.
C. Syllable classes
In this study, we have used USVs from two different works.1,51. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.00030675. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 Based on a preliminary observation of our own recordings, we found that the USV types defined by Scattoni et al.11. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.0003067 and Grimsley et al.55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 offered good coverage of the observed USV repertoire. Thus, we have adopted the USV definitions of some classes of these works to define the classes we worked with. Below are listed the considered USV types, with the matching characteristics (mostly TF related characteristics).
Complex: Monosyllabic vocalization with two or more directional changes in frequency >6 kHz (Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(a)];
One frequency step: Syllables with two components, in which the second one is  10 kHz different from the preceding component and without separation in time55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 [Fig. 2(b)];
Multiple frequency steps: Syllables where two or more instantaneous frequency changes appear as vertically discontinuous “steps” on the spectrogram, but with no interruption in time (adapted from Refs. 11. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.0003067 and 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(c)];
Upward: Syllables with frequency upwardly modulated with a frequency change  6 kHz (Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(d)];
Downward: Syllables with frequency downwardly modulated with a frequency change  6 kHz (Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(e)];
Flat: Syllables with constant frequency, with frequency modulation <6 kHz (Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(f)];
Short: Syllables that last <5 ms (Ref. 55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460) [Fig. 2(g)];
Chevron: Syllables shaped like an inverted U, where the highest frequency was at least 6 kHz greater than the starting and ending frequencies55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 [Fig. 2(h)];
Reverse chevron: Syllables shaped like a U, where the lowest frequency is at least 6 kHz lower than the starting and ending frequencies55. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460 [Fig. 2(i)];
Composite: Syllables formed by two harmonically independent components, emitted simultaneously11. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.0003067 [Fig. 2(j)].
D. Dataset annotation
An automatic segmentation algorithm (described in Sec. IV A) was used as a pre-processing step to isolate individual USVs. Afterward, they were manually annotated for supervised machine learning model training. This segmentation algorithm was the key to speed up the construction of a larger dataset with USVs. Without an automatic segmentation algorithm, we would have to go over all sound recordings and manually annotate all USVs. However, it should be noted that by using the automatic method to isolate the sounds of interest, we may have lost some USVs, as the segmentation algorithm is not perfect, as will be discussed below.
For each segmented USV, we have manually determined its characteristics, such as duration, frequency variation, and number of frequency steps, among others. With those characteristics, we have assigned a class to each USV, based on the definitions presented in Sec. III C.
As previously mentioned in Sec. III A, our acquisition setup has two microphones. Hence, due to the physical communication between recording chambers, mouse vocalization may be captured simultaneously by the two microphones. Based on the above-mentioned considerations, the signals from both channels were jointly analyzed, and whenever a vocalization was present in both recordings, it was attributed to one channel, according to the quality of the records in each of them. If an USV was present in the sound of the two microphones, we would choose the one with higher spectral power. This was done by looking at both TF representations of each channel simultaneously.
From the annotation process, we have gathered a dataset comprising 4632 USVs, whose class distribution is presented in Table II.
Table icon
TABLE II. Dataset class distribution.
ClassNo. of vocalizationsRatio
Complex460.010
One frequency step6620.143
Multiple frequency steps1540.033
Upward17800.384
Downward3680.079
Flat9470.204
Short3900.084
Chevron1350.029
Reverse chevron540.012
Composite960.021
In this section, we present the proposed methods, both for the segmentation and classification of USVs. The proposed methods were developed with matlab R2018b using an Intel (Santa Clara, CA) i7–4700HQ processor and 8 GB of RAM. The methods for segmentation and classification are available online in https://github.com/DiogoMPessoa/Mice-USVs-segmentation-and-classification.git, as part of a matlab application. Figure 3 presents the overall methodological pipeline.
A. Segmentation
The first processing step consists of the segmentation of individual USVs. The segmentation algorithm proposed in this paper is based on the spectral entropy (SE) analysis.15,1615. D. Pessoa, L. Petrella, M. Castelo-Branco, and C. Teixeira, “ Automatic segmentation of ultrasonic vocalizations in rodents,” in Proceedings of MEDICON 2019, Coimbra, Portugal (September 26–28, 2019), pp. 37–46.16. S. Vajda, “ The mathematical theory of communication. by Claude E. Shannon and Warren Weaver. Pp. 117 $2.50. 1949. (University of Illinois Press, Urbana),” Math. Gazette 34(310), 312–313 (1950). https://doi.org/10.2307/3611062 The spectral entropy, H(t), of a signal corresponds to its normalized power distribution in the frequency domain as a probability distribution P(m).1717. Mathworks, “ Spectral entropy of signal—MATLAB pentropy—mathworks.com,” https://www.mathworks.com/help/signal/ref/pentropy.html (Last viewed May 6, 2022). Thus, H(t) corresponds to the Shannon entropy of the spectral distribution of a signal at a given instant t, and it is defined as
H(t)=m=1NP(t,m)log2P(t,m), (1)
with
P(t,m)=S(t,m)fS(t,f), (2)
where N is the total frequency points, S(t, m) is the normalized power spectrum at t, and S(t, f) is the TF spectrogram. The information entropy has also been previously used in a bioacoustics study by Erbe et al.1818. C. Erbe and A. R. King, “ Automatic detection of marine mammals using information entropy,” J. Acoust. Soc. Am. 124(5), 2833–2840 (2008). https://doi.org/10.1121/1.2982368
The segmentation algorithm is composed of two parts: the pre-processing and the segmentation itself. For pre-processing, the signal is filtered with a high-pass Butterworth filter of order 20, with a cut-off frequency of 30 kHz and 30 dB of stop band attenuation. The high-pass filter aims to reduce the interference of low-frequency noise generated by the mouse movements and by other external factors.
After the pre-processing phase, the signal was divided into multiple non-overlapping windows of 1 s, and for each window, the spectral entropy was computed. To compute the spectral entropy, we have used matlab's pentropy function,1717. Mathworks, “ Spectral entropy of signal—MATLAB pentropy—mathworks.com,” https://www.mathworks.com/help/signal/ref/pentropy.html (Last viewed May 6, 2022). with the default parameters. The function used a Hamming window with 950 samples, no overlap, and 256 points to compute the discrete Fourier transform. This set of parameters resulted in an output vector for the spectral entropy where the temporal samples were separated by 0.0019 s (1.9 ms). Therefore, the resolution of the segmentation algorithm is 1.9 ms. Afterward, the temporal instants where the spectral entropy is below a certain threshold (threshold 1, orange line in Fig. 4) are stored as the initial and final points of a vocalization. In addition, the value of the spectral entropy must also be smaller than a second threshold (threshold 2, red line in Fig. 4) at least for some point within the interval to reduce the false positive rate of the algorithm. The first threshold was estimated to be as high as possible without reaching the higher basal line of the spectral entropy, while the second threshold was tested with multiple values, in a grid search process. For instance, for the 1-s windows we used, if we set the threshold value as 0.98, it would be mostly overlapping the baseline of the spectral entropy. Besides that, to account for USVs that might have some segments of lower spectral intensity, successive detected vocalizations separated by 15 ms or less were merged and considered as a single USV.1919. A. Ivanenko, P. Watkins, M. A. J. van Gerven, K. Hammerschmidt, and B. Englitz, “ Classifying sex and strain from mouse ultrasonic vocalizations using deep learning,” PLoS Comput. Biol. 16(6), e1007918 (2020). https://doi.org/10.1371/journal.pcbi.1007918
To further reduce the number of false positives, the segmented USVs are passed through an image classification system, to be classified as vocalizations or noise, using a bag of visual words (BOVWs) classification model.20,2120. Y. Zhang, R. Jin, and Z.-H. Zhou, “ Understanding bag-of-words model: A statistical framework,” Int. J. Mach. Learn. Cybern. 1(1), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-021. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “ Visual categorization with bags of keypoints,” in Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic (May 11–14, 2004), pp. 1–22.
The concept of image classification with BOVW was adapted from a methodology originally developed for text classification and natural language processing (NLP).2121. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “ Visual categorization with bags of keypoints,” in Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic (May 11–14, 2004), pp. 1–22. In text classification applications, the bag of words (BOW) works by counting the number of times each word appears in a document, and then, using the frequency of each word in the document, a frequency histogram of all words is created to represent the document. The concept is similar in image classification, differing only in the information used to represent the objects to classify. In image classification, the objects are defined by frequency histograms of visual words, which are descriptors used to characterize the images.2020. Y. Zhang, R. Jin, and Z.-H. Zhou, “ Understanding bag-of-words model: A statistical framework,” Int. J. Mach. Learn. Cybern. 1(1), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0
To use the BOVW, the detected vocalizations are represented in the TF domain using the spectrogram representation obtained with the STFT. To compute the spectrogram, we used Hamming windows with 256 samples, 50% overlap, and 1024 points to compute the discrete Fourier transform. After the TF representations were obtained, they were converted to images. They were then resized to 250× 250 pixel gray-level images to reduce computational load while processing and extracting features. To train the BOVW classifier, 1600 images were used: 800 images of real vocalizations and 800 noisy images. The KAZE features2222. P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “ Kaze features,” in Computer Vision—ECCV 2012, edited by A. Fitzgibbon , S. Lazebnik , P. Perona , Y. Sato , and C. Schmid ( Springer, Berlin, 2012), pp. 214–227. were then used to extract the key-points of each image, using a dense representation and extraction with multiple scales (×1.6 and ×3.2). KAZE features have been developed by detecting and describing image features in a nonlinear scale space through the application of nonlinear diffusion filters.2222. P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “ Kaze features,” in Computer Vision—ECCV 2012, edited by A. Fitzgibbon , S. Lazebnik , P. Perona , Y. Sato , and C. Schmid ( Springer, Berlin, 2012), pp. 214–227. To extract the KAZE features, we have used the matlab function detectKAZEFeatures.2323. Mathworks, “ Kaze features,” https://www.mathworks.com/help/vision/ref/detectkazefeatures.html (Last viewed May 6, 2022). The default values for the scales at which the features are extracted are ×1.6, ×3.2, ×4.8, and ×6.4. We have only kept the first two scales since we empirically observed that for higher scale values, the extracted features were not providing relevant information.
An example of the final output of the segmentation algorithm is presented in Fig. 4.
B. Feature extraction
After USV segmentation, multiple descriptors were extracted from time, frequency, and TF domains. The extracted features are listed in Table III, some of them used for the first time for USV analysis, namely, number of frequency steps, number of directional changes, and trend. We add those features as they are directly linked to the intrinsic definitions of USVs.
Table icon
TABLE III. Feature summary table.
DomainFeatures
TimeTime-amplitude based features
Zero crossing rate
Tonal power ratio
Short-time energy
FrequencyPSDa based features
Signal power
Tonality
Spectral centroid
Spectral spread
Spectral slope
Spectral rolloff
Spectral flux
Spectral decrease
Spectral crest factor
Spectral flatness
Spectral skewness
Spectral kurtosis
Spectral entropy
Spectral pitch chroma
Spectral edge frequency
Harmonic components
Time-frequencyLocal binary patterns
Contour-based features
Number of frequency steps
Total number128 features
aPower spectral density (PSD).
The frequency-domain features were extracted using the audio content analysis toolbox.2424. A. Lerch, An Introduction to Audio Content Analysis ( Wiley, New York, 2012). https://doi.org/10.1002/9781118393550 The contour-based features were extracted from the time-frequency ridge (TFR) of each vocalization. The TFR extraction is often used in bioacoustics studies,25–2825. J. Xie, M. Towsey, J. Zhang, X. Dong, and P. Roe, “ Application of image processing techniques for frog call classification,” in Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec, Canada (September 27–30, 2015), pp. 4190–4194.26. M. A. Roch, T. S. Brandes, B. Patel, Y. Barkley, S. Baumann-Pickering, and M. S. Soldevilla, “ Automated extraction of odontocete whistle contours,” J. Acoust. Soc. Am. 130(4), 2212–2223 (2011). https://doi.org/10.1121/1.362482127. A. Mallawaarachchi, S. H. Ong, M. Chitre, and E. Taylor, “ Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles,” J. Acoust. Soc. Am. 124(2), 1159–1170 (2008). https://doi.org/10.1121/1.294571128. H. Ou, W. W. L. Au, L. M. Zurk, and M. O. Lammers, “ Automated extraction and classification of time-frequency contours in humpback vocalizations,” J. Acoust. Soc. Am. 133(1), 301–310 (2013). https://doi.org/10.1121/1.4770251 since it allows the extraction of several characteristics related to the TF representation of the sounds. Indeed, the extraction of TF domain features from USVs presents a substantial challenge, not only due to the significant amount of noise present in audio signals, but also due to the typical irregularities and discontinuities in the USVs' TF spectrum.
In this work, a method based on TFR extraction was developed to extract the TFR and compute several USVs features. The TFR extraction method is based on the search of the appropriate ridge curve [which consists of a sequence of amplitude peak positions (ridge points) on the TF map]. This curve provides a measure of USVs' instantaneous frequency along the time.2929. D. Iatsenko, P. McClintock, and A. Stefanovska, “ Extraction of instantaneous frequencies from ridges in time–frequency representations of signals,” Signal Process. 125, 290–303 (2016). https://doi.org/10.1016/j.sigpro.2016.01.024 The first step to determine this curve is the computation of the TF map, using the STFT. After that, the TFR extraction method iterates for each temporal instant of the spectrogram and determines the point of maximum spectral power, which is stored in a vector containing the value of the maximum frequency per iteration. Last, after the determination of the vector containing all frequency values for every temporal instant, a smoothing operation using a linear regression method [locally weighted scatterplot smoothing (LOWESS)] was applied.3030. G. W. Moran, “ Locally-weighted-regression scatter-plot smoothing (lowess): A graphical exploratory data analysis technique (1984-09),” http://hdl.handle.net/10945/19419 (Last viewed May 6, 2022). An example of TFR estimation is presented in Fig. 5. After the TFR estimation, eight primary features and five secondary features were obtained. The features extracted from this curve are named the contour-based features. A brief description of each one is presented below.
(1)
Peak frequency: Corresponds to the highest frequency value;
(2)
Minimum frequency: Corresponds to the lowest frequency value;
(3)
Initial frequency: Corresponds to the frequency at the initial temporal instant (beginning);
(4)
Final frequency: Corresponds to the frequency at the final temporal instant (ending);
(5)
Frequency bandwidth: Corresponds to the difference between the highest and the lowest frequency values;
(6)
Number of directional changes: Corresponds to the number of times the trend of contour vector has changed direction;
(7)
Duration: Corresponds to the time between the initial and final instants;
(8)
Trend: Binary feature that translates the global behavior of the USV. Has the value of 1 if the final frequency is bigger than the initial one, −1 for the opposite situation, and 0 if both frequencies are equal, within a certain tolerance (±0.1 kHz).
C. Classification
Several shallow and supervised classifiers were tested, including SVM, k-nearest neighbor (k-NN), Fisher discriminant analysis (FDA), decision tree (DT), and ensemble of decision trees (EDT). Furthermore, feature selection techniques were also employed with the use of two filtering techniques, namely Kruskal–Wallis and ReliefF.
SVM have been used in multiple bioacoustics studies.13,3113. A. P. Vogel, A. Tsanas, and M. L. Scattoni, “ Quantifying ultrasonic mouse vocalizations using acoustic analysis in a supervised statistical machine learning framework,” Sci. Rep. 9(1), 8100 (2019). https://doi.org/10.1038/s41598-019-44221-331. J. Xie, M. Towsey, J. Zhang, and P. Roe, “ Acoustic classification of Australian frogs based on enhanced features and machine learning algorithms,” Appl. Acoust. 113, 193–201 (2016). https://doi.org/10.1016/j.apacoust.2016.06.029 SVM are binary classifiers that can also be combined to solve multi-class problems.32,3332. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) ( Springer, Berlin, 2006).33. C.-W. Hsu and C.-J. Lin, “ A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw. 13(2), 415–425 (2002). https://doi.org/10.1109/72.991427 This type of classifier works by determining the decision hyperplane that maximizes the separation margin between classes, being defined as a maximum margin classifier.3232. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) ( Springer, Berlin, 2006). SVM determine support vectors for each class, which are at a distance 1/||w|| from the separation hyperplane. Therefore, these support vectors will determine the margin of separation. The principle of SVM training involves the minimization of the criterion ψ(w), given by
ψ(w)=12||w||2CNi=1ξi;subjectto (3)
yi(wxi+b)1ξi,i=1,,n. (4)
The parameter C (cost) gives the possibility of allowing misclassifications during training. The higher the C value is, the smaller the classifier margin is and the less permissive to misclassification the model is, and vice versa. Furthermore, in SVM, kernel functions can also be used to map nonlinear spaces and deal with nonlinearly separated data. The main idea is to project the data into a higher-dimensional space to find a hyperplane able to separate the samples. Some of the more common kernels are the radial basis functions (RBF), polynomials, and Gaussians.
The k-NN model, also commonly used in bioacoustics studies,31,3431. J. Xie, M. Towsey, J. Zhang, and P. Roe, “ Acoustic classification of Australian frogs based on enhanced features and machine learning algorithms,” Appl. Acoust. 113, 193–201 (2016). https://doi.org/10.1016/j.apacoust.2016.06.02934. S. Gunasekaran and K. Revathy, “ Content-based classification and retrieval of wild animal sounds using feature selection algorithm,” in Proceedings of the 2010 Second International Conference on Machine Learning and Computing, Bangalore, India (February 9–11, 2010), pp. 272–275. is a non-parametric method since no assumptions regarding the data distribution are made. k-NN assigns the class to a new pattern according to the class labels that are found in the majority among the k-nearest neighbors.
FDA is part of a broader type of classification technique called discriminant analysis. This classification method works by projecting the data into a smaller dimension, where the separation between classes and compactness within each class of the data are maximized.3535. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, a Wiley-Interscience Publication, 2nd ed. ( Wiley, Nashville, TN, 2000). More precisely, the projection is guided by minimization of the Fisher criterion, described by
J(w)=wTSBwwTSww, (5)
where w are the projection weights, SB is the between classes scatter matrix that measures separation between class samples, and Sw is the within classes scatter matrix that assesses compactness within each class. After data projection, linear decision hyperplanes are determined to separate the data from multiple classes.
DTs are non-parametric methods used for both classification and regression problems. The goal of these classification algorithms is to create a model that predicts the value of a target variable based on several input variables by following tree-like structures, with “if… then… else…” construction blocks. DTs typically grow in a top-down way, and growing a decision tree involves deciding at each step about which features to choose and what conditions to use for splitting, with predefined stop growing criteria. The decision rules are based on a single feature and a threshold, which tries to look for the combination that splits the data into the purest subsets. The process acts recursively, until the maximum depth of the tree is reached or until the data impurity can no longer be reduced. There are several criteria used to measure impurity, such as the Gini impurity or the entropy.3636. J. R. Quinlan, “ Induction of decision trees,” Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1007/BF00116251
The EDT is an ensemble classification model that combines several decision trees to produce a combined model. The ensemble method usually results in a better performing model when compared to the use of a single decision tree. There are two main techniques used to develop EDT classifiers and ensemble methods, called bagging and boosting.3737. A. Lemmens and C. Croux, “ Bagging and boosting classification trees to predict churn,” J. Mark. Res. 43(2), 276–286 (2006). https://doi.org/10.1509/jmkr.43.2.276 In this work, we have considered the bagging method.
As mentioned before, to perform feature selection, we have considered two different filter-based algorithms (Kruskal–Wallis and ReliefF).
Kruskal–Wallis is a non-parametric statistical test typically used to determine features' discriminative power. The test works by sorting the feature values and assigns ordinal ranks to find their discriminative power. The sums of these ranks for the classes are then used to compute the value of H statistics, which reflects the difference of the ranks' sums. The H values are given by
H=12n(n+1)i=1cni(RiR¯)2, (6)
where Ri is the average of ranks for the samples belonging to class i, R¯ is the average of all ranks for all classes, ni is the number of patterns belonging to class i, and n is the total number of patterns.
ReliefF is an attribute estimator algorithm that finds the weights of predictors in a single or multi-class categorical variable. This filter method is typically robust and noise tolerant.3838. M. Robnik-Šikonja and I. Kononenko, “ Theoretical and empirical analysis of relieff and rrelieff,” Mach. Learn. 53(1), 23–69 (2003). https://doi.org/10.1023/A:1025667309714 This method determines a score for each feature, which is then applied to rank and select top scoring features. The algorithm works by searching for k-nearest neighbors from the same class, called nearest hits (Hj), and also k-nearest neighbors from the other classes, called nearest misses [Mj(C)]. After that, the weight value for each feature j is updated based on the following expression:
Wj=WdydjWdyWdjWdydjmWdy, (7)
where Wdy is the weight of having different values for the response y; Wdj is the weight of having different values for the predictor Fj; and Wdydj is the weight of having different response values and different values for the predictor Fj.
To avoid the over-fitting phenomenon, the annotated data (see Table II) were divided into training and testing datasets, containing 70% and 30% of the data, accordingly. The training dataset will then be used to perform feature selection as well as hyper-parameter optimization (for SVM and k-NN classifiers). To perform this optimization, a stratified k-fold cross-validation methodology was applied. Last, the models with the best parameters and the subsets of features showing the best performance were applied to the new data, i.e., to the testing dataset. It is worth noting that the data from the test dataset were never used during the training phase to perform any type of optimization.
D. Evaluation metrics
In this subsection, we describe the metrics used to assess the performance of both segmentation and classification methods.
1. Segmentation
To assess the segmentation algorithm, we used two different metrics, namely, recall and precision. On the one hand, the recall value translates how many USVs the method is able to detect, out of the total number of USVs. On the other hand, the precision value translates the false positive rate. The equations for both metrics are presented,
Recall=DE(DE+FE), (8)
Precision=DE(DE+UE), (9)
where DE (detected events) is the number of correctly identified USVs, FE (false events) is the number of falsely identified USVs, and UE (undetected events) is the number of USVs that were not detected. It should be noted that a segmented USV was considered to be a DE only if the predicted segment had an overlap of at least 50% with the real USV annotation. Otherwise, it is considered as an undetected USV (UE).
2. Classification
For the assessment of machine learning classification methods, we have used five different metrics, namely, accuracy, precision, recall, specificity, and F1 score. Since we are working with a multi-class classification problem, we have calculated all metrics for every class in a one-vs-all fashion,
Accuracy=(TP+TN)(TP+TN+FP+FN), (10)
Precision=TP(TP+FP), (11)
Recall=TP(TP+FN), (12)
Specificity=TN(TN+FP), (13)
F1Score(F1)=(2×Precision×Recall)(Precision+Recall), (14)
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
A. Segmentation
To independently evaluate our segmentation algorithm, we have randomly selected, out of all our recordings, 16 audio files (with 45 s each). Then, by iterating through the TF representations of those 16 recordings, we have manually annotated all USVs present. It should be noted that this was a process carried out completely “by hand.” In total, from the 16 files, we have manually isolated/detected 460 USVs. The obtained intervals for the beginning and ending instants of those vocalization were used as the ground truth values for segmentation assessment. The manually segmented USVs were checked by two different annotators.
To evaluate the algorithm performance, the results obtained by automatic segmentation were compared to the results obtained by manual annotation; precision and recall were considered as performance indicators. Furthermore, to test the robustness of the algorithm, different levels of Gaussian white noise were added to the signal before segmentation. Using this strategy, we were able to assess the robustness of the segmentation algorithm to noise. We have chosen this type of noise since it adds a broadband interference to the sounds across all frequencies. Moreover, the addition of Gaussian white noise will partially mask and corrupt some USVs, which was also one of our objectives. In total, six different SNR levels were considered: 100, 90, 80, 70, 60, and 50 dB. Our objective with the addition with this type of noise was to corrupt and mask the USVs to later assess whether the segmentation algorithm was still able to detect them. In Fig. 6, we have a sample of the output produced by the function at the different levels considered. As seen in the plots, we can see that this type of noise does in fact mask the USVs, by adding broadband noise that is visible in all frequency bands across the spectrogram.
As mentioned before (see Sec. III A), the segmentation algorithm has two different thresholds. The first threshold was settled to 0.97, whereas the second threshold was tested with multiple values. The obtained results for precision and recall are illustrated in Fig. 7. Here, it can be observed that the value chosen for the second threshold establishes a clear trade-off between precision and recall. The closer this value is to the value of the first threshold (0.97), the higher the number of vocalizations that can be recognized, increasing the true and false positives rates, which corresponds to higher recall and lower precision values, accordingly. On the other hand, if the second threshold increases, the algorithm is more permissive (i.e., recognizes USVs with lower spectral power). However, if the second threshold decreases, the precision values are higher than recall ones, i.e., most of the signals recognized are indeed vocalizations. Thus, the algorithm is unable to recognize lower spectral power vocalizations.
After finding the range of values for the second threshold corresponding to the best performance, i.e., {0.94; 0.95; 0.96; 0.97}, the algorithm was tested in the presence of Gaussian white noise. Several levels of SNR were considered. It should be noted that the power of the input signal before adding noise is considered as 0 dB. For each value of SNR, the segmentation algorithm was tested five times, and then the mean and the standard deviation of the precision and recall values were calculated (see Fig. 8). Here, it is possible to observe that for SNR levels higher than 60, both recall and precision are kept relatively constant. However, when below this value, we observe a significant drop in performance. As observed in Fig. 6, when considering the SNR value as 50 dB, most USVs are corrupted. This observation is in line with the significant performance drop for this SNR. Therefore, we can conclude that the algorithm presents some level of robustness to Gaussian white noise (introduced to simulate broadband noise and corrupted USVs).
As mentioned before, another approach was taken to reduce the false positive rate of the segmentation process, by using a BOVW classifier as a post-processing classification technique. To assess the performance of the classification algorithm, an independent group of 1600 images (USVs) was selected, 30% of these data being used to train the BOVW and 70% used for testing. The related confusion matrix is presented in Table IV.
Table icon
TABLE IV. Confusion matrix for BOVW method (training).
Known PredictedNoiseVocalization
Noise0.990.01
Vocalization0.080.92
With the use of the BOVW classifier after the segmentation algorithm, it was possible to diminish the false positive rate and, therefore, increase the precision value. Using this post-processing classification, the obtained precision and recall values were 0.74 and 0.94, respectively. The precision value was increased by 11% and the recall value decreased by 3% when compared with the results obtained with the first and second thresholds set to 0.97.
The automated segmentation algorithm based on spectral entropy was able to correctly detect up to 97% of the vocalizations when considering the most permissive set of thresholds. USV records commonly contain lots of noisy sounds with similar power levels, which tends to increase the false positive rate. However, in our method, the false positive rate can be reduced by fine-tuning the lowest threshold. Whit this tuning, it was possible to obtain 0.85 and 0.86 of recall and precision, respectively, with the second threshold set to 0.94. Changing the second threshold to 0.95, the values of recall and precision changed to 0.90 and 0.83, respectively.
The addition of the BOVW classification as a post-processing technique led to the reduction of the rate of false positives, effectively increasing the precision value by 11%. Notwithstanding, this came at a cost, which was a decrease in the recall by 3%. Despite the classifier presenting good results in the training phase, when used after the spectral entropy, the results got worse in terms of recall. This was expected since BOVW classifiers commonly require large datasets. Thus, by increasing the dataset size to train this post-processing classifier, we should expect an increase in its performance.
1. Comparative analysis
To compare the performance of our segmentation method with other works, we have used two available software programs, Mupet and DeepSqueak.7,117. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-611. M. Van Segbroeck, A. T. Knoll, P. Levitt, and S. Narayanan, “ Mupet–mouse ultrasonic profile extraction: A signal processing tool for rapid and unsupervised analysis of ultrasonic vocalizations,” Neuron 94(3), 465–485 (2017). https://doi.org/10.1016/j.neuron.2017.04.005 We have tested their segmentation models with the same files that were used to test the spectral entropy segmentation algorithm, as described in Sec. V A. Table V presents the obtained results for the different methods. The models were tested in their default configurations. For the DeepSqueak software, we have considered all of the three different models available, as discriminated in Table V.
Table icon
TABLE V. Comparison with other works from the literature with available software. Highest values for precision and recall are shown in boldface.
MethodPrecisionRecall
DeepSqueak (long rat detector) (Ref. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6)0.400.01
DeepSqueak (mouse detector) (Ref. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6)0.870.95
DeepSqueak (rat detector) (Ref. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6)0.860.94
Mupet (Ref. 1111. M. Van Segbroeck, A. T. Knoll, P. Levitt, and S. Narayanan, “ Mupet–mouse ultrasonic profile extraction: A signal processing tool for rapid and unsupervised analysis of ultrasonic vocalizations,” Neuron 94(3), 465–485 (2017). https://doi.org/10.1016/j.neuron.2017.04.005)0.810.79
USVSEG (Ref. 1212. R. O. Tachibana, K. Kanno, S. Okabe, K. I. Kobayasi, and K. Okanoya, “ USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents,” PLoS One 15(2), e0228907 (2020). https://doi.org/10.1371/journal.pone.0228907)0.920.78
Proposed (Th1a − 0.97) + BOVW0.740.94
Proposed (Th1 − 0.97)0.630.97
Proposed (Th1 − 0.97; Th2b − 0.95)0.830.90
Proposed (Th1 − 0.97; Th2 − 0.94)0.860.85
aThreshold 1 (Th1).
bThreshold 2 (Th2).
From the analysis of Table V, we can observe that, in its most permissive form [SE (Th1 − 0.97)], our method achieves a higher recall value when compared to the other methods. However, the precision of our method is lower. While USVSEG was the method that achieved the highest precision, our method was the one with higher recall. Overall, the performance of our segmentation algorithm is similar to the DeepSqueak software, with the latter performing better in terms of precision. However, while in this software the models require a resource-intensive training process for the development of the neural networks, our model does not require any training process, and it is significantly faster when deployed. Moreover, our method does not require any USV labeling. Regarding the Mupet software, our method outperforms it, both in terms of recall and precision, in several configurations. As future work, we aim to explore other approaches to further reduce the rate of false positive detections and, therefore, increase the precision of our method.
B. Classification
As stated before, five different classifiers were trained and tested for the classification of the segmented syllables (Table II). However, before using the testing dataset to assess our models, the whole set of features presented in Table III were analyzed through feature selection techniques to increase the classification performance. The feature selection methods used were Kruskal–Wallis and ReliefF. The performance was evaluated by testing the multiple classification methods with different numbers of features to determine the best subset of features for each classifier. This process was carried out using a tenfold cross-validation strategy, and the results are represented in Tables VI and VII.
Table icon
TABLE VI. Weighted mean of F1 score for feature selection with maximum TFR extraction and with Kruskal–Wallis (training dataset tenfold cross-validation).
No. of featuresSVMk-NNDTFDAEDT
530.40 ± 8.3054.90 ± 1.3053.50 ± 1.9050.10 ± 2.1061.20 ± 2.40
2062.70 ± 1.9065.10 ± 3.4064.90 ± 2.2051.60 ± 3.0073.60 ± 2.50
3566.90 ± 2.2062.70 ± 2.5064.40 ± 3.1052.80 ± 3.6075.40 ± 2.40
5070.30 ± 2.6063.30 ± 1.7064.60 ± 1.5053.50 ± 3.9075.50 ± 2.50
6570.60 ± 1.8062.20 ± 1.6064.90 ± 2.1054.00 ± 3.9076.40 ± 2.20
8071.80 ± 2.0060.90 ± 2.3066.40 ± 2.5053.50 ± 4.7075.80 ± 1.80
9572.90 ± 2.0060.60 ± 2.1065.10 ± 2.7056.00 ± 2.7076.70 ± 1.50
11073.80 ± 1.9059.50 ± 2.2066.20 ± 2.5055.20 ± 3.8076.30 ± 2.10
12572.60 ± 2.2055.30 ± 2.0066.40 ± 2.4055.20 ± 3.9076.80 ± 2.30
12872.20 ± 2.0055.40 ± 2.5065.50 ± 1.9055.60 ± 3.5076.30 ± 3.30
Table icon
TABLE VII. Weighted mean of F1 score for feature selection with maximum TFR extraction and with ReliefF (training dataset tenfold cross-validation).
No. of featuresSVMk-NNDTFDAEDT
535.50 ± 7.2042.20 ± 2.6050.40 ± 2.6040.70 ± 3.8050.50 ± 2.40
2067.20 ± 2.0066.00 ± 2.1067.20 ± 2.5049.00 ± 1.7076.50 ± 3.00
3570.30 ± 2.6064.50 ± 1.4066.80 ± 1.7050.20 ± 3.9076.80 ± 1.70
5071.30 ± 2.6065.30 ± 2.9067.50 ± 3.0051.20 ± 2.5076.50 ± 2.60
6572.10 ± 1.8064.80 ± 2.5067.40 ± 2.6050.90 ± 2.0077.10 ± 1.80
8073.00 ± 2.4064.80 ± 1.8067.10 ± 2.9052.70 ± 3.8077.50 ± 1.70
9574.00 ± 1.8064.30 ± 2.7066.90 ± 2.2053.80 ± 4.5076.60 ± 1.90
11073.70 ± 2.4060.00 ± 2.6066.40 ± 2.3054.90 ± 4.0076.80 ± 2.00
12572.90 ± 3.0055.60 ± 3.0065.40 ± 2.2056.30 ± 3.1076.20 ± 1.70
12873.20 ± 2.5054.60 ± 1.5065.50 ± 2.5056.00 ± 2.4076.40 ± 1.40
After determining the subsets of features for all classifiers, and the hyper-parameters in the SVM and k-NN cases that provide the best performance in the training dataset, the models were evaluated with the testing dataset. From the analysis of Tables VI and VII, we selected the best subset of features obtained with the two feature selection algorithms for each model. Then, using those feature sets, we have trained our models and assessed them in the testing set. The results obtained in the testing set are presented in Table VIII, with all the metrics computed as the weighted mean (according to the number of examples of each class in the entire dataset).
Table icon
TABLE VIII. Weighted mean values for all classifiers obtained in the test dataset (boldface underlined values correspond to the two best classifiers).
ClassifierParametersFeature selectionNo. of FeaturesPrecisionRecallSpecificityF1 scoreAccuracy
SVMC=21Kruskal11076.1175.1692.6975.2575.2
LinearOne-vs-one
SVMC=21ReliefF9575.6674.3792.8774.5374.4
LinearOne-vs-one
SVMC=21
γ=23
Kruskal11074.2074.3091.8074.0574.3
RBFOne-vs-one
SVMC=21
γ=23
ReliefF9575.2775.9591.7975.3876.0
RBFOne-vs-one
k-NNk = 110Kruskal2070.3370.1989.1667.2370.2
metric = cityblock
k-NNk = 55ReliefF2074.1872.6491.0471.1272.6
metric = cityblock
DTKruskal12870.1468.4091.5068.9568.4
DTReliefF5070.1168.5291.21068.9868.3
FDAPseudo-linear discriminantKruskal9557.6449.1785.0049.8349.2
FDAPseudo-linear discriminantReliefF12560.5352.1286.4553.2452.1
EDTNum. Trees-250Kruskal12579.3380.2093.0079.5080.2
GentleBoost
EDTNum. Trees-250ReliefF8080.3181.1493.3780.581.1
GentleBoost
The model with the best performance in the testing set was the EDTs, using 80 features selected with the ReliefF method. In Table IX, we present the obtained results on this specific case divided by class. From the analysis of the figure, it was observed that for all the weighted evaluation metrics, performance values superior to 80% were obtained. Figure 9 depicts the percentage of features that were chosen from each feature domain. We can observe that the contour-based features were all selected, demonstrating its appropriateness for USV classification. As previously mentioned, contour-based features are the set of futures obtained from the curves of the USVs that were obtained from the TFR (as described in Sec. IV B). Since many of those features are directly used on the definition of USVs, we argue that they are, in fact, relevant in this context. The EDT method presented the highest performance metrics. This can be in part expected as USV classes are defined and manually annotated in a top-down process with several “if… then… else…” conditions, resembling the way this type of classification model works.
Table icon
TABLE IX. EDT classifier with 80 features selected with ReliefF, 250 trees, and GentleBoost method (test dataset).
ClassPrecisionRecallSpecificityF1 score
Complex45.538.599.541.7
One frequency step63.857.194.160.3
Multiple frequency steps61.340.498.948.7
Upward84.592.787.488.4
Downward70.082.997.881.4
Flat91.486.397.588.8
Short84.794.198.189.2
Chevron75.075.099.175.0
Reverse chevron54.637.599.644.4
Composite57.939.399.346.8
Mean69.964.497.166.5
Weighted mean80.381.193.480.46
Analyzing the best classification model (EDT using 80 features selected with ReliefF) performance in detail (Table IX and Fig. 10), one can observe that the algorithm can discriminate satisfactorily the classes upward, downward, flat, and short. This happens because these classes have more strict spectral characteristics and are the most vocalized syllables. On the other hand, the classes complex, reverse chevron, and composite presented the worst classification performance. This could be explained, at least in part, by the low number of vocalizations available to train the models. Through the analysis of the confusion matrix in Fig. 10, one can observe that the two most mislabeled classes within each other were the one-frequency step and the upward class. Although the one-frequency step class has a very strict and unequivocal identifier, i.e., the number of frequency steps, this feature is difficult to extract properly. This difficulty in properly extracting the number of frequency steps originates diverse misclassification problems, as also observed for the multiple-frequency step class. Notwithstanding, for a multi-class problem with ten unevenly distributed classes, the performance obtained with the EDT method is quite satisfactory, with the weighted mean of the recall, precision, and F1 score values all around 80%, as well as the global accuracy. Another important observation to reinforce is that all the features related to the contour have been used in the EDT classifier (see Fig. 9), demonstrating their importance for the differentiation of USVs. This happens since contour-based features extracted from the TFR curve are directly related to several characteristics that are used to define USV classes, such as the frequency variation.
While the classification methods presented in Refs. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6 and 99. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386 can only distinguish between five USV classes, the classifiers used in our work can classify 10 USV classes. Since in Refs. 77. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6 and 99. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386, the classification performance was not evaluated, a quantitative comparison with our methods is not possible; moreover, the USV class definitions differ to some extent. On the other hand, the classification models developed in Ref. 1212. R. O. Tachibana, K. Kanno, S. Okabe, K. I. Kobayasi, and K. Okanoya, “ USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents,” PLoS One 15(2), e0228907 (2020). https://doi.org/10.1371/journal.pone.0228907 can discriminate between nine different types of USVs, matching most of the USV classes used in our work. Also, the authors of Ref. 1313. A. P. Vogel, A. Tsanas, and M. L. Scattoni, “ Quantifying ultrasonic mouse vocalizations using acoustic analysis in a supervised statistical machine learning framework,” Sci. Rep. 9(1), 8100 (2019). https://doi.org/10.1038/s41598-019-44221-3 have obtained similar classification performance, as mentioned in Sec. II. Notwithstanding, the number of USVs used in Ref. 1313. A. P. Vogel, A. Tsanas, and M. L. Scattoni, “ Quantifying ultrasonic mouse vocalizations using acoustic analysis in a supervised statistical machine learning framework,” Sci. Rep. 9(1), 8100 (2019). https://doi.org/10.1038/s41598-019-44221-3, i.e., 225 USVs, was far smaller, and some of the features were extracted using commercial software. Therefore, the work presented in this paper presents a more realistic performance analysis since more than 4000 USVs, obtained under diversified experimental tests and circumstances, were considered.
In this paper, an automatic framework for mouse USV recognition and classification is presented. This framework comprises, as the first step, a method to isolate individual USVs, i.e., a segmentation algorithm, based on spectral entropy. The proposed algorithm demonstrated great potential, being able to reach recall values of up to 97% in its most permissive configuration. Furthermore, classical machine learning methods were also employed for automatic syllable classification, showing good performance. In this sense, several well-known classifiers were employed, such as SVM, k-NN, Fisher-LDA, DT, and EDT, with the tree ensemble methods being the ones that presented the best results. Multiple feature domains were tested to assess their effectiveness on the classification algorithms, with the appropriateness of the new introduced contour-related features being observed. The overall approach pipeline was designed to classify ten USV classes, a greater number than what is observed in similar studies in the area.
As future work, the feature extraction methods should be further refined, namely the method for TFR extraction. Deep learning approaches, such as CNNs and RNNs, based on the use of TF representations of the USVs (spectrograms, continuous wavelet transform, etc.) should also be explored. Moreover, further tests should be performed both on the segmentation and classification algorithms with data acquired from external systems to assess their robustness and performance.
In conclusion, this work presents a complete pipeline for processing raw audio files and detecting and classifying mouse ultrasonic vocalizations. Given the suitable performance obtained in both tasks considered, this work presents a valuable tool that can be used in future mouse behavioral studies based on USVs.
This work is funded by the FCT - Foundation for Science and Technology, I.P./MCTES through national funds (PIDDAC), within the scope of CISUC R&D Unit (UIDB/00326/2020 and UIDP/00326/2020), and CIBIT R&D Unit (FCT/UIDB/4950 and FCT/UIDP/4950).
  1. 1. M. L. Scattoni, S. U. Gandhy, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in the BTBR T+tf/J mouse model of autism,” PLoS One 3(8), e3067 (2008). https://doi.org/10.1371/journal.pone.0003067, Google ScholarCrossref
  2. 2. M. Wöhr and R. K. W. Schwarting, “ Affective communication in rodents: Ultrasonic vocalizations as a tool for research on emotion and motivation,” Cell Tissue Res. 354(1), 81–97 (2013). https://doi.org/10.1007/s00441-013-1607-9, Google ScholarCrossref
  3. 3. J. B. Panksepp, K. A. Jochman, J. U. Kim, J. J. Koy, E. D. Wilson, Q. Chen, C. R. Wilson, and G. P. Lahvis, “ Affiliative behavior, ultrasonic communication and social reward are influenced by genetic variation in adolescent mice,” PLoS One 2(4), e351 (2007). https://doi.org/10.1371/journal.pone.0000351, Google ScholarCrossref
  4. 4. M. L. Scattoni, L. Ricceri, and J. N. Crawley, “ Unusual repertoire of vocalizations in adult BTBR T+tf/J mice during three types of social encounters,” Genes Brain Behav. 10(1), 44–56 (2011). https://doi.org/10.1111/j.1601-183X.2010.00623.x, Google ScholarCrossref
  5. 5. J. M. S. Grimsley, J. J. M. Monaghan, and J. J. Wenstrup, “ Development of social vocalizations in mice,” PLoS One 6(3), e17460 (2011). https://doi.org/10.1371/journal.pone.0017460, Google ScholarCrossref
  6. 6. J. Chabout, A. Sarkar, D. B. Dunson, and E. D. Jarvis, “ Male mice song syntax depends on social contexts and influences female preferences,” Front. Behav. Neurosci. 9, 76 (2015). https://doi.org/10.3389/fnbeh.2015.00076, Google ScholarCrossref
  7. 7. K. R. Coffey, R. G. Marx, and J. F. Neumaier, “ DeepSqueak: A deep learning-based system for detection and analysis of ultrasonic vocalizations,” Neuropsychopharmacology 44(5), 859–868 (2019). https://doi.org/10.1038/s41386-018-0303-6, Google ScholarCrossref
  8. 8. M. L. Dent , R. R. Fay , and A. N. Popper, Rodent Bioacoustics ( Springer, New York, 2018). https://doi.org/10.1007/978-3-319-92495-3, Google ScholarCrossref
  9. 9. T. E. Holy and Z. Guo, “ Ultrasonic songs of male mice,” PLOS Biol. 3(12), e386 (2005). https://doi.org/10.1371/journal.pbio.0030386, Google ScholarCrossref
  10. 10. A. H. Fonseca, G. M. Santana, G. M. B. Ortiz, S. Bampi, and M. O. Dietrich, “ Analysis of ultrasonic vocalizations from mice using computer vision and machine learning,” Elife 10, e59161 (2021). https://doi.org/10.7554/eLife.59161, Google ScholarCrossref
  11. 11. M. Van Segbroeck, A. T. Knoll, P. Levitt, and S. Narayanan, “ Mupet–mouse ultrasonic profile extraction: A signal processing tool for rapid and unsupervised analysis of ultrasonic vocalizations,” Neuron 94(3), 465–485 (2017). https://doi.org/10.1016/j.neuron.2017.04.005, Google ScholarCrossref
  12. 12. R. O. Tachibana, K. Kanno, S. Okabe, K. I. Kobayasi, and K. Okanoya, “ USVSEG: A robust method for segmentation of ultrasonic vocalizations in rodents,” PLoS One 15(2), e0228907 (2020). https://doi.org/10.1371/journal.pone.0228907, Google ScholarCrossref
  13. 13. A. P. Vogel, A. Tsanas, and M. L. Scattoni, “ Quantifying ultrasonic mouse vocalizations using acoustic analysis in a supervised statistical machine learning framework,” Sci. Rep. 9(1), 8100 (2019). https://doi.org/10.1038/s41598-019-44221-3, Google ScholarCrossref
  14. 14. L. R. Saraiva, K. Kondoh, X. Ye, K.-H. Yoon, M. Hernandez, and L. B. Buck, “ Combinatorial effects of odorants on mouse behavior,” Proc. Natl. Acad. Sci. U.S.A. 113(23), E3300–E3306 (2016). https://doi.org/10.1073/pnas.1605973113, Google ScholarCrossref
  15. 15. D. Pessoa, L. Petrella, M. Castelo-Branco, and C. Teixeira, “ Automatic segmentation of ultrasonic vocalizations in rodents,” in Proceedings of MEDICON 2019, Coimbra, Portugal (September 26–28, 2019), pp. 37–46. Google Scholar
  16. 16. S. Vajda, “ The mathematical theory of communication. by Claude E. Shannon and Warren Weaver. Pp. 117 $2.50. 1949. (University of Illinois Press, Urbana),” Math. Gazette 34(310), 312–313 (1950). https://doi.org/10.2307/3611062, Google ScholarCrossref
  17. 17. Mathworks, “ Spectral entropy of signal—MATLAB pentropy—mathworks.com,” https://www.mathworks.com/help/signal/ref/pentropy.html (Last viewed May 6, 2022). Google Scholar
  18. 18. C. Erbe and A. R. King, “ Automatic detection of marine mammals using information entropy,” J. Acoust. Soc. Am. 124(5), 2833–2840 (2008). https://doi.org/10.1121/1.2982368, Google ScholarScitation, ISI
  19. 19. A. Ivanenko, P. Watkins, M. A. J. van Gerven, K. Hammerschmidt, and B. Englitz, “ Classifying sex and strain from mouse ultrasonic vocalizations using deep learning,” PLoS Comput. Biol. 16(6), e1007918 (2020). https://doi.org/10.1371/journal.pcbi.1007918, Google ScholarCrossref
  20. 20. Y. Zhang, R. Jin, and Z.-H. Zhou, “ Understanding bag-of-words model: A statistical framework,” Int. J. Mach. Learn. Cybern. 1(1), 43–52 (2010). https://doi.org/10.1007/s13042-010-0001-0, Google ScholarCrossref
  21. 21. G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “ Visual categorization with bags of keypoints,” in Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic (May 11–14, 2004), pp. 1–22. Google Scholar
  22. 22. P. F. Alcantarilla, A. Bartoli, and A. J. Davison, “ Kaze features,” in Computer Vision—ECCV 2012, edited by A. Fitzgibbon , S. Lazebnik , P. Perona , Y. Sato , and C. Schmid ( Springer, Berlin, 2012), pp. 214–227. Google ScholarCrossref
  23. 23. Mathworks, “ Kaze features,” https://www.mathworks.com/help/vision/ref/detectkazefeatures.html (Last viewed May 6, 2022). Google Scholar
  24. 24. A. Lerch, An Introduction to Audio Content Analysis ( Wiley, New York, 2012). https://doi.org/10.1002/9781118393550, Google ScholarCrossref
  25. 25. J. Xie, M. Towsey, J. Zhang, X. Dong, and P. Roe, “ Application of image processing techniques for frog call classification,” in Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec, Canada (September 27–30, 2015), pp. 4190–4194. Google ScholarCrossref
  26. 26. M. A. Roch, T. S. Brandes, B. Patel, Y. Barkley, S. Baumann-Pickering, and M. S. Soldevilla, “ Automated extraction of odontocete whistle contours,” J. Acoust. Soc. Am. 130(4), 2212–2223 (2011). https://doi.org/10.1121/1.3624821, Google ScholarScitation, ISI
  27. 27. A. Mallawaarachchi, S. H. Ong, M. Chitre, and E. Taylor, “ Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles,” J. Acoust. Soc. Am. 124(2), 1159–1170 (2008). https://doi.org/10.1121/1.2945711, Google ScholarScitation, ISI
  28. 28. H. Ou, W. W. L. Au, L. M. Zurk, and M. O. Lammers, “ Automated extraction and classification of time-frequency contours in humpback vocalizations,” J. Acoust. Soc. Am. 133(1), 301–310 (2013). https://doi.org/10.1121/1.4770251, Google ScholarScitation, ISI
  29. 29. D. Iatsenko, P. McClintock, and A. Stefanovska, “ Extraction of instantaneous frequencies from ridges in time–frequency representations of signals,” Signal Process. 125, 290–303 (2016). https://doi.org/10.1016/j.sigpro.2016.01.024, Google ScholarCrossref
  30. 30. G. W. Moran, “ Locally-weighted-regression scatter-plot smoothing (lowess): A graphical exploratory data analysis technique (1984-09),” http://hdl.handle.net/10945/19419 (Last viewed May 6, 2022). Google Scholar
  31. 31. J. Xie, M. Towsey, J. Zhang, and P. Roe, “ Acoustic classification of Australian frogs based on enhanced features and machine learning algorithms,” Appl. Acoust. 113, 193–201 (2016). https://doi.org/10.1016/j.apacoust.2016.06.029, Google ScholarCrossref
  32. 32. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) ( Springer, Berlin, 2006). Google Scholar
  33. 33. C.-W. Hsu and C.-J. Lin, “ A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw. 13(2), 415–425 (2002). https://doi.org/10.1109/72.991427, Google ScholarCrossref
  34. 34. S. Gunasekaran and K. Revathy, “ Content-based classification and retrieval of wild animal sounds using feature selection algorithm,” in Proceedings of the 2010 Second International Conference on Machine Learning and Computing, Bangalore, India (February 9–11, 2010), pp. 272–275. Google ScholarCrossref
  35. 35. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, a Wiley-Interscience Publication, 2nd ed. ( Wiley, Nashville, TN, 2000). Google Scholar
  36. 36. J. R. Quinlan, “ Induction of decision trees,” Mach. Learn. 1(1), 81–106 (1986). https://doi.org/10.1007/BF00116251, Google ScholarCrossref
  37. 37. A. Lemmens and C. Croux, “ Bagging and boosting classification trees to predict churn,” J. Mark. Res. 43(2), 276–286 (2006). https://doi.org/10.1509/jmkr.43.2.276, Google ScholarCrossref
  38. 38. M. Robnik-Šikonja and I. Kononenko, “ Theoretical and empirical analysis of relieff and rrelieff,” Mach. Learn. 53(1), 23–69 (2003). https://doi.org/10.1023/A:1025667309714, Google ScholarCrossref
  1. © 2022 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).