1 Introduction
Sounds are around human being everywhere and due to physic properties of the sounds one can heard the acoustics of these. Acoustic events refer to several everyday sounds which are generated in natural or artificial form (namely, the sounds found in the environment of the everyday life, excluding speech and music). The development of an acoustic event recognition system (AERS), contributes to the development of intelligent systems capable to understand sound within a context. These systems are important for real-world applications such as activity monitoring systems [15, 39, 45], ambient assisted living [28, 40, 50], human-computer interaction [8, 41, 42], security surveillance [1, 2], assisted robotics [21, 38, 48], among others.
Automatic recognition of acoustic events in real situations is not an easy task, because the audio captured by microphones contains a mixture of different sources of sound. Recent research work about AERS has focused on two types of classification problems: acoustic events classification for a specific context and acoustic events recognition into contextual classes [52, 53]. The former can be associated to, for instance, the activity recognition in a home environment, where acoustic events can offer information that occur in a specific dwelling space.
The audio information for scene understanding, can be more assertive if they exclusively recognize the acoustic events that occur in a specific place. On the other hand, there is the recognition of human activity from the sounds that occur in different places, for instance, the difference of contextual events between the home and the office. It would be difficult to say what kind of activity is carried out, if the contextual classes of the sounds are not clear. Besides, not limiting the sounds in the scene will be even more difficult this task.
Preliminary work with AERS adopted approaches used for the processing of speech and music, however, the non-stationary characteristics of the acoustic events made the recognition of events problematic for databases with a great number of sound sources [13]. For example, in speech recognition is common to use a phonetic structure that can be seen as a basic component of voice, therefore, spoken words can be divided into elemental phonemes over which it is possible the application of probabilistic models. Conversely, phoneme based approaches cannot be applied to acoustic events coming from sounds created by a car crash or due to pouring water in a glass. Even, if it is possible to create a dictionary of basic units of these events, modelling signal variation in time would be difficult. The same occurs whit the attempt to compare music and acoustic events because the latter does not exhibit significant stationary patterns such as melody and rhythm [13].
The recognition of acoustic events involves two phases: a feature extraction phase followed by a classification phase. The feature extraction phase allows to play two roles; a dimension reduction role, and a representation role. An AERS uses stationary and non-stationary feature extraction techniques. Most of the features extraction algorithms use a scheme called bag-of-frames. The bag-of-frame approach consists of considering the signal in a blind way, using a systematic and general scheme where the signal is divided into consecutive overlapping frames, from which a vector of features is determined. The features are supposed to represent characteristic information of the signal for the problem at hand. These vectors are then aggregated (hence the “bag”) and fed to the next phase of an audio recognition system [3].
Audio signals have been traditionally characterized by Mel Frequency Cepstral Coefficients (MFCC). The methodology for computing MFCC involves a filter bank that approximates some important properties of the human auditory system. MFCC has been shown to work well for structured sounds such as speech and music [16, 23, 25, 26, 27, 37]. Since MFCC has been successfully used in speech and music applications, some work suggests the use of MFCC for characterizing acoustic events that contains a large and diverse variety of sounds, including those with strong temporal domain [4, 35, 40]. In addition, MFCC are often used by researchers for benchmarking their works.
For the classification phase of an AERS there are different machine learning techniques such as Support Vector Machine (SVM). SVM is a classifier that discriminates the data by creating boundaries between classes rather than estimating class conditional densities, or in other words, that SVM could draw accurate classification rate even if the sample size is small, a common scenario for acoustic event classification [14, 24, 51].
Artificial Neural Networks (ANN) is another machine learning technique being widely used for audio recognition systems. ANN deals with the study and construction of systems able to learn from the data. ANN algorithms infer unknowns from known data a characteristic that might describe acoustic events where there is an acoustic event of interest that need to be differentiated from a mix of sounds. [7, 29, 36].
There are other techniques that can be used to identify acoustic events such as audio signatures recognition. In these technique the challenge is to find the acoustic events that sound similar to the audio that the system captures. The similarity rate is evaluated using a distance function. Audio signatures use two fundamental processes to be determined, a feature extraction process and a modeling process. The latter refers to the minimal compacted representation that can be achieved to describe a signal, but which robustness preserve the model against typical audio degradation [40]. Audio signatures thus work very well on AERS, but the problem is complicated when it is required to identify an acoustic event present in a mixture of sounds. This problem usually leads to apply source separation techniques and machine learning algorithms to treat with the complexity of the signals.
In this work we considered the signals unprocessed. Also, we use no source separation technique because our intention is to evaluate the robustness to retain the characteristic information of an acoustic event against the background noise using two audio features, MFCC that is the state–of–the–art benchmark and the multiband spectral entropy signature (MSES), a technique that has been successfully used in audio fingerprinting, speech recognition and others applications of audio [6, 9, 10, 11, 12, 30].
In addition, MSES feature has never been studied to recognize acoustic events exclusively for indoor domestic environments. For the previously mentioned, the audio signature approach is used, namely, it is assumed that only there is one instance per acoustic event (for the traditional audio signature approach, only there is one version of the songs) for the types of sound classes to be considered and versions contaminated with noise of that instance (it is similar to distort each song with different types of degradations). Therefore, the aim is not to classify different instances of acoustic events into classes, but to evaluate robustness of MFCC versus MSES using a low level of SNR (Signal to Noise Ratio) in the mix of acoustic events.
The machine learning techniques utilized in this paper were selected according to the results in recognition and classification issues of related literature, besides the configuration of them are performed follow the experimental ideas in that literature and in some cases with optimization algorithms [7, 14, 21, 26].
Regarding classification, an optimization with genetic algorithm and particle swarm optimization were developed in order to improve the performance of the best combination achieved between audio features and the studied classifiers.
The built database is an additional contribution since there is no database in the reviewed works similar to the one that we propose in this paper. It has the particularity of being complex in its construction by mixing sounds at a low level of SNR. Forward, we describe in detail this database and we encourage to the readers to use it in their future works. In our case, it will be part of our tested towards exploring recognition of activity for elders living alone, for instance, to identify acoustic of events that might indicate whether the elder is using the blender or for identifying sounds of risk in a home environment.
2 Theoretical Background
The characterization of audio signals is related to the process of extracting the characteristics that abstractly describe a signal and reflect their most relevant aspects of perception. To extract the characteristics of an audio signal, it is common to segment the signal in short frames, possibly overlapping it sufficiently close to each other, in such a way that multiple events distinguishable or perceptual are not covered in a single frame [3]. This process of splitting the signal into frames is a characteristic part for computing MFCC and MSES. The next subsections describe the process for determining both audio features, as well as the different classification techniques used in the experiments that support our results.
2.1 Mel Frequency Cepstral Coefficients
MFCC are short-term spectral-based features and its success have been due to their ability to represent the amplitude spectrum in a compact form. MFCC is based on the non-linear frequency scale of human auditory perception which use two types of filters, linearly spaced filters and logarithmically spaced filters. The signal is expressed in Mel’s frequency scale to capture the most important characteristics of an audio [46].
For computing MFCC, the audio signal is divided into short time frames for extracting from each one a feature vector with
The process continues scaling the magnitude spectrum
for
for
In this work, we will focus on the ISP implementation for computing MFCC [46], this implementation considers a filter bank
2.2 Shannon’s Entropy and Spectral Entropy
When the audio signals are severely degraded, the features that describe it usually disappear, therefore, the problem becomes finding the features that would still be present in the signal despite the level of degradation to which it was subjected. Authors focused on this problem have explored entropy to characterize audio signals as robustly as possible to different types of degradations. In this address, we will start by discussing about the Shannon’s entropy and spectral entropy concept.
In information theory, Shannon’s entropy is related to the uncertainty of a source of information [43]. For example, entropy is used to measure the predictability of a random signal and the “peakiness” of a probability distribution function. In research, it is common to use (4) to measure, through entropy, the amount of information the signal carries. Here,
Some estimate of the Probability Distribution Function (PDF) is needed to determine the entropy of a signal, therefore, it can be used both parametric and non-parametric methods, and histograms. If histograms are chosen, we have to be careful that the amount of data involved is high enough to avoid peaks in the histogram.
When talking about spectral entropy it is necessary to review Shen’s work [44], since that concept was introduced for the first time as an additional feature for endpoint detection (voice activity detection). The idea of spectral entropy compromises to consider the spectrum of a signal as a PDF to capture the peaks of the spectrum and their location. In order to convert the spectrum into a PDF, the individual frequency components of the spectrum are separated and divided by sum of all the components, namely,
The concept of multiband spectral entropy was introduced by [32], and it consists of dividing the spectrum into equal-sized sub-bands to compute entropy on each one of them by using (4), where each sub-band spectrum should be assumed a PDF. Additionally, [33] proved that the multiband spectral entropy works very well with additive wide-band noise and at low levels of SNR.
2.3 Multiband Spectral Entropy Signature
Based on the idea presented by Misra et al. [32, 33], spectral entropy concept can be used for getting a robust signature that can be useful in different audio recognition issues [6, 9, 10, 11, 12, 30]. Unlike Misra et al., this work compute entropy at each sub-band by using the entropy of a random process [9].
Let
Taking some precautions, the entropy of a Gaussian random vector can be determined using the continuous version of the Shannon’s entropy, which is given by (5):
If it is assumed that the random vector follows a Gaussian distribution with a mean of zero and the covariance matrix given by
In order to compute MSES, the audio signal should be divided into frames, and for each of these to extract a vector with
Critical Band | Lower cut-off (Hz) | Central Frequency (Hz) | Higher cut-off (Hz) | Bandwidth (Hz) |
1 | 0 | 50 | 100 | 100 |
2 | 100 | 150 | 200 | 100 |
3 | 200 | 250 | 300 | 100 |
4 | 300 | 350 | 400 | 100 |
5 | 400 | 450 | 510 | 110 |
6 | 510 | 570 | 630 | 120 |
7 | 630 | 700 | 770 | 140 |
8 | 770 | 840 | 920 | 150 |
9 | 920 | 1000 | 1080 | 160 |
10 | 1080 | 1170 | 1270 | 190 |
11 | 1270 | 1370 | 1480 | 210 |
12 | 1480 | 1600 | 1720 | 240 |
13 | 1720 | 1850 | 2000 | 280 |
14 | 2000 | 2150 | 2320 | 320 |
15 | 2320 | 2500 | 2700 | 380 |
16 | 2700 | 2900 | 3150 | 450 |
17 | 3150 | 3400 | 3700 | 550 |
18 | 3700 | 4000 | 4400 | 700 |
19 | 4400 | 4800 | 5300 | 900 |
20 | 5300 | 5800 | 6400 | 1100 |
21 | 6400 | 7000 | 7700 | 1300 |
22 | 7700 | 8500 | 9500 | 1800 |
23 | 9500 | 10500 | 12000 | 2500 |
24 | 12000 | 13500 | 15500 | 3500 |
We use (7) to change Hertz to Barks, where
The process continues computing entropy for each one of the critical bands using (6). It was considered for each sub-band that spectral coefficients are distributed normally. This consideration is due to that a good estimate of the PDF cannot be determined by using non-parametric methods, since the lowest bands of the spectrum have too few coefficients. For computing entropy, a random process with two random variables was considered. Real and imaginary parts of the spectral coefficients are assumed to be random variables with a normal distribution and zero mean, hence, for the two-dimensional case the entropy is determined by
This signature captures the level of information content for every critical band and frame position in time.
Figures 1 and 2 show the signatures of two acoustic events that are obtained with the MSES method. The signals in time domain of the acoustic events ”Bread being sliced” (Fig. 1) and ”Microwave On-Off” (Fig. 2) are showed in the upper panels, whereas the spectrograms of both signals appears in the middle panels. The bottom of each one of the Figures displays the signatures for both acoustic events using MSES method.
2.4 Similarity Distance Functions
A measure of similarity indicates the strength of the relationship between two data points. The more the two data points resemble one another, the larger the similarity measure is. Let
A similarity distance function refers to a function
The idea of similarity is more consistent if one considers the function of Hamming distance, since it determines the distance between two arbitrary data points as the number of symbols or bits in which they differ. Another distance that is adopted to measure the similarity between two data points is the Cosine distance [17]. Cosine distance measures the similarity between two vectors in a space that has an internal product with which the value of the cosine of the angle between them is evaluated.
2.5 Artificial Neural Networks
The Artificial Neural Network (ANN) is a mathematical model that simulate the behavior of a biological neuron of humans. The ANN emulate the process of learning of the humans based at the equation (9). The approach for succeeding learning depends of the inputs
The transfer function used in a neural network can be the sigmoidal function, linear function and hyperbolic tangent sigmoid function. For training the neural network is utilized the back-propagation method for update the weights in each epoch. The algorithm for learning can be the descendent gradient with the variants of learning rate, momentum and the use of both, also the scaled conjugate gradient and the variants of Fletcher-Reeves, Polak-Ribiére and Powell-Beale for the conjugate gradient.
2.6 Support Vector Machine
The Support Vector Machine (SVM) model is a supervised algorithm that creates a hyperplane which separates data into classes. The objective is to find an optimal plane that maximizes the distance between the separating hyperplane and the closed points (defined as support vectors) of the training data set. If the data is non-linear separable, there is a modified version of SVM which projects the original data to a high-dimensional space by the implementations of kernel functions. In the literature, there have been proposed different kernels such as linear, Gaussian and polynomial. In the case of a multi-class scenario, the SVM model assigns the label of +1 to one of them and -1 to all the remaining classes. This results in
3 Database
The kitchen is one of the home’s spaces where different sound sources can occur at the same time specially when cooking. For this work we are interested in a kitchen environment where three different sound sources are occurring at the same moment. We believe that by mixing three sounds it can achieve a kitchen environment more realistic. Sounds mixing process considers as background disturbance (the noise) two of the three sounds sources, and the remain sound is the acoustic event (the signal) to be recognized. Additionally, we add an extra component to the sounds mixing process, which it consists of making the identification of the signal into the noise more perceptually difficult. The previous can be carried out using 3dB (decibels) of SNR.
In the literature, it is common to find databases containing different kinds of acoustic events, however, it is difficult to find a database with a mixture of kitchen sounds. Due to the above, our work consisted in building a database using the scheme presented in Beltrán-Márquez et al [5]. Sixteen archives of audio were collected where each one is a class of kitchen sound. The portals where these sounds were downloaded are, http://www.soundsnap.com, http://www.freesfx.co.uk and http://www.sounddogs.com. The audio files are WAV format, with a 44100Hz of sampling frequency and coded to 16 bits. No copyright infringement was intended. The downloaded sounds are presented in Table 2.
Soundsa | Hamming MFCC | Distance MSES | Cosine MFCC | Distance MSES |
C1 | 43.80 | 51.42 | 49.52 | 95.23 |
C2 | 0 | 0 | 0.95 | 6.66 |
C3 | 100 | 100 | 90.47 | 94.28 |
C4 | 100 | 100 | 84.76 | 100 |
C5 | 99.04 | 100 | 80 | 80.95 |
C6 | 100 | 98.09 | 100 | 51.42 |
C7 | 41.90 | 40 | 44.76 | 97.14 |
C8 | 65.71 | 62.85 | 42.85 | 100 |
C9 | 27.61 | 40 | 36.19 | 45.71 |
C10 | 57.14 | 79.04 | 69.52 | 78.09 |
C11 | 14.28 | 38.09 | 28.57 | 17.14 |
C12 | 43.80 | 35.23 | 61.90 | 87.61 |
C13 | 100 | 100 | 94.28 | 96.19 |
C14 | 53.33 | 70.47 | 63.80 | 97.14 |
C15 | 98.09 | 84.76 | 85.71 | 100 |
C16 | 100 | 100 | 81.90 | 100 |
Average | 65.29 | 68.75 | 63.45 | 77.97 |
a The different acoustic events are: (C1) Bread being sliced, (C2) Chop food quickly and strongly, (C3) Pouring soda into a glass, (C4) Electric blender liquefying food, (C5) Frying chicken in a pan, (C6) Hot oil in a pan, (C7) Burner of a stove, (C8) Making popcorn in a microwave, (C9) Cooking fryer, (C10) Peeling potatoes, (C11) Making popcorn in a pot, (C12) Turning a microwave on and off, (C13) Pouring water into a glass, (C14) Slicing onions, (C15) Boiling teapot, and (C16) Boiling eggs.
The audio signatures approach suggests the use of signatures between one to fifteen seconds. All downloaded audio files have a length of three seconds (we consider that three seconds of audio is enough to identify a sound from the environment). As indicated above, the database consists of sixteen original sounds for mixing. First, mixing process consists of forming a dataset with the mixture of all the combinations of pairs of sounds. Second, all the elements from the dataset are combined with each one of the sixteen original sounds for getting mixtures with three sounds. Repetitions of sounds in a single mixture are avoided. All mixtures are obtained using 3dB of SNR, for this, the sixteen original sounds are considered the “signal” (the acoustic events to identify) and the elements of the dataset as the “noise”. Figure 3 shows a illustration of the mixing process of sounds. The equation
In the experiments, we used classifiers such as Similarity Distance, k- Nearest Neighbors (KNN), SVM and ANN. For the experiments with KNN, ANN and SVM, we generated a training dataset to train the models of classification (this is because the elements of the database will be used as test elements to assess the classification models). This training dataset is built by using the original signal of each one of the sixteen kitchen sounds and two degraded versions of each one of them (this procedure guarantees having more data for training since there are not more instances for each class of sound). Degradation consists of distorting the signal by adding white Gaussian noise. We use
4 Experiments
In this work, we use measures of similarity as baseline experiment to have a starting point or a first measurement in relation to the performance indicators of the considered classifiers. Certainly, the search by similarity identifies which candidate identities are more similar to one or more input entities for coincidence. In the next section, we describe how compute this entities from an approach of audio signatures.
On the other hand, as previously stated, the ANN and SVM algorithms have already been used in acoustic event classification tasks [7, 14, 24, 29, 51, 36]. Therefore, we consider it appropriate to include these models in our experiments by using the Bayesian optimization strategy for SVM and different architectures for ANN to identify their best parameters. Also, as a baseline model, the KNN algorithm was included in our study because of its easy implementation, and for this particular case, to test different distance metrics and number of neighbors.
To understand the process to be followed in our experiments, Figure 4 shows the block diagram of the sequential activities carried out in this section.
4.1 MFCC and MSES Signatures
To extract both MFCC and MSES signatures, the next procedure was implemented. a) First, stereo signals are changed to monoaural by averaging both channels, and each audio is cut to three seconds of length. b) Frames of 30ms are used to divide the monoaural signal (i.e. we use 1323 samples per frame using a sampling frequency of 44100Hz). c) Consecutive frames have an overlap of 50%, hence, there are 200 frames (
With the FFTs we are ready to compute Mel Frequency Cepstral Coeffcients (section 2.1), and the Multiband Spectral Entropy Signature (section 2.3). An additional point is that MSES signatures are extracted considering a bandwidth of 0Hz up to 8000Hz, hence, only 21 critical bands are used. The above entails each feature vector be 21-dimensional (
4.2 Baseline Experiment with Similarity Distances
Baseline experiment consists of using similarity distances for recognition of acoustic events from the database of the kitchen sounds. Baseline experiment considers two different signatures, one uses normalized values and the other binary values. To normalize the signatures, we normalized the
Haitsma’s work presents a method to binarize audio signatures. This method consists of taking the sign of the differences between consecutive values [22]. For the baseline experiment the sign of the differences is encoded using
4.3 Experiment with Artificial Neural Networks
This experiment consists of training neural networks to classify the acoustic events that are considered the signal (not the noise) in the audios of the database. Two neural networks were considered, one trained with MFCC signatures and the other trained with MSES signatures. To train the neural networks, we used the normalized signatures that are extracted from each audio of the training dataset. Therefore, we have 48 signatures for training the neural network for MFCC and other 48 signatures for training the neural network for MSES.
For both MFCC and MSES, the neural networks consist of 2 hidden layers and 16 neurons in the output layer; the input layer has 4200 neurons (i.e., each signature of size
In the third design, the scaled conjugate gradient back-propagation is applied with 79 neurons in the first hidden layer and 22 neurons in the second hidden layer. To set the number of neurons, a search was made for the best performance of the neural network in the learning stage by increasing one neuron from 10 up to 200 in the hidden layers. Finally, for each ANN, the first and second hidden layers, the hyperbolic tangent sigmoid transfer function is applied and for the output layer, the logarithmic-sigmoid transfer function is implemented.
The classification process consisted of assessing the neural networks using the normalized signature of the mixture of kitchen sounds of the database. If a neural network correctly classifies a given acoustic event in the entire database, then there will be 105 true positives for that class. The performance goal and numbers of epochs for all the neural networks are 1e-06 and 8000, respectively.
4.4 Experiment with Support Vector Machines
The same training dataset used for the ANN is used for the experiments with SVM. In our implementation, the
4.5 Experiment with K-Nearest Neighbors
Similar than the models based on SVM, the optimizer hyper-parameter function of MATLAB®,
5 Results and Discussion
In this section, we compare results about the performance of MSES and MFCC using four types of classifiers: similarity distances, KNN, ANN and SVM. Results are showed using True Positives (TP) and False Positives (FP) from the confusion matrices, the best experimental outcomes and the averages achieved with each classifier are summarized in Table 7.
Experiment | NNGDA | NNGDX | NNSCG | |||
MFCC | MSES | MFCC | MSES | MFCC | MSES | |
1 | 73.21 | 88.1 | 75 | 90.95 | 73.69 | 89.05 |
2 | 73.15 | 87.92 | 74.88 | 90.24 | 73.51 | 88.99 |
3 | 73.04 | 87.8 | 74.52 | 86.76 | 73.27 | 88.33 |
4 | 73.04 | 87.8 | 74.17 | 89.23 | 73.15 | 88.21 |
5 | 72.98 | 87.8 | 74.11 | 89.17 | 73.15 | 88.21 |
6 | 72.92 | 87.68 | 73.87 | 89.17 | 73.1 | 88.1 |
7 | 72.92 | 87.5 | 73.81 | 89.11 | 72.98 | 87.74 |
8 | 72.86 | 87.5 | 73.81 | 89.05 | 72.92 | 87.74 |
9 | 72.8 | 87.5 | 73.75 | 88.75 | 72.92 | 87.62 |
10 | 72.8 | 87.44 | 73.75 | 88.69 | 72.86 | 87.62 |
Best Result | 73.21 | 88.1 | 75.00 | 90.95 | 73.69 | 89.05 |
Average | 72.62 | 86.94 | 73.42 | 88.00 | 72.59 | 87.23 |
Sounds | MFCC | MSES | ||
TP | FP | TP | FP | |
C1 | 91 | 32 | 98 | 5 |
C2 | 37 | 5 | 105 | 18 |
C3 | 97 | 3 | 104 | 0 |
C4 | 56 | 5 | 105 | 17 |
C5 | 104 | 176 | 105 | 1 |
C6 | 103 | 41 | 81 | 1 |
C7 | 72 | 19 | 63 | 0 |
C8 | 88 | 18 | 104 | 14 |
C9 | 38 | 0 | 77 | 1 |
C10 | 89 | 15 | 90 | 0 |
C11 | 46 | 4 | 75 | 1 |
C12 | 89 | 9 | 101 | 0 |
C13 | 96 | 26 | 105 | 9 |
C14 | 75 | 16 | 105 | 48 |
C15 | 90 | 37 | 105 | 31 |
C16 | 89 | 14 | 105 | 6 |
Sounds | MFCC | MSES | ||
TP | FP | TP | FP | |
C1 | 68 | 24 | 103 | 26 |
C2 | 32 | 13 | 88 | 12 |
C3 | 93 | 3 | 104 | 0 |
C4 | 81 | 63 | 97 | 10 |
C5 | 105 | 191 | 59 | 0 |
C6 | 104 | 3 | 44 | 2 |
C7 | 28 | 10 | 90 | 40 |
C8 | 83 | 29 | 16 | 0 |
C9 | 43 | 4 | 89 | 9 |
C10 | 88 | 9 | 105 | 9 |
C11 | 40 | 3 | 93 | 27 |
C12 | 88 | 31 | 105 | 7 |
C13 | 94 | 72 | 105 | 31 |
C14 | 0 | 0 | 105 | 71 |
C15 | 91 | 72 | 103 | 9 |
C16 | 91 | 24 | 105 | 16 |
Sounds | MFCC | MSES | ||
TP | FP | TP | FP | |
C1 | 58 | 11 | 104 | 18 |
C2 | 1 | 0 | 90 | 12 |
C3 | 104 | 1 | 103 | 0 |
C4 | 105 | 113 | 103 | 25 |
C5 | 104 | 134 | 73 | 1 |
C6 | 105 | 37 | 17 | 0 |
C7 | 60 | 57 | 100 | 29 |
C8 | 45 | 17 | 105 | 34 |
C9 | 39 | 3 | 85 | 3 |
C10 | 72 | 21 | 95 | 0 |
C11 | 19 | 6 | 72 | 5 |
C12 | 50 | 2 | 102 | 4 |
C13 | 105 | 77 | 105 | 11 |
C14 | 54 | 9 | 105 | 22 |
C15 | 94 | 70 | 104 | 29 |
C16 | 90 | 17 | 105 | 19 |
Method | Feature | |
MFCC (%) | MSES (%) | |
Similarity Distance | 65.29 | 77.97 |
ANN | 73.42 | 88.00 |
SVM | 67.20 | 83.99 |
KNN | 65.77 | 87.38 |
5.1 Similarity Distance Results
Table 2 shows the results for each signature using Hamming distance and Cosine distance, here the recall metric is used for results comparison. Although it is common to use binary signatures in an audio signature-based approach, the results of Table 2 suggests that binary signatures are not convenient to represent acoustic events, especially, when they have non-stationary characteristics.
The difference in recall between both features is about the 3.46%, therefore, no advantage can be seen by using MFCC or MSES features. An audio signature using normalized values seens to work better, allowing to differentiate more the performance of both feature extraction methods, especially, when working with low levels of SNR.
Hamming distance results, Table 2, shows that C2 was the worst classified class because it has zero in recall score, while, C3, C4, C13 and C16 are the classes of sounds with the higher recall in both features, 100% in all of them. The average recall for MFCC features is 65.29% and 68.75% for using MSES features.
The results with Cosine distance using MSES feature, shows that C4, C8, C15 and C16 are the classes of sounds getting the higher recall score, whereas C2 was the worst classified class for both features.
In this case, the average recall obtained for MFCC features is 63.45% and 77.97% for MSES features (i.e., the difference of recall between both features is about the 14.52%).
The results of this experiment mark the baseline and the starting point to evaluate the contribution of machine learning methods. An image-based representation of the results with similarity distances, KNN, SVM and ANN methods for both MFCC and MSES methods is showed in Figure 5 using confusion matrices.
5.2 Artificial Neural Network Results
Table 3 shows the results obtained with the neural network architectures using back-propagation with gradient descendent and adaptive learning rate (NNGDA), gradient descendent with momentum and adaptive learning rate (NNGDX) and scaled conjugate gradient (NNSCG). The best recall achieved for MFCC features is of 75% and for MSES features is 90.95%, both with NNGDX. The average is obtained for 30 experiments, but only 10 experiments are presented in Table 3. The best average recall score was 73.42% and 88% for MFCC and MSES respectively, both from NNGDX method.
Table 4 shows the results about True Positives (TP) and False Positives (FP) from the confusion matrix obtained for the best performance with artificial neural networks using MFCC and MSES, respectively. For MFCC, C5 and C6 are the classes that obtained the higher scores, 1 and 2 errors respectively. At least 14 samples of each class of kitchen sounds (except by C6) are classified erroneously as C5.
For MSES, C7 is the class with more errors, 42 in total, followed by C11 (30 errors) and C9 (28 errors). Unlike the experiment with similarity distances, here the sound class C2 is 100% classified. The others classes with higher scores are C3, C4, C5, C8, C13, C14, C15 and C16. Indeed, experiments with ANN show that there is an increase in the recall with which kitchen sounds are identified. Comparing the average value achieved with distances of similarity and neural networks, there is an increase of recall of 8.13% for MFCC and 10.03% for MSES.
5.3 Support Vector Machine Results
Table 5 shows the results for TP and FP from the confusion matrix obtained with the SVM classifier using MFCC and MSES features, respectively. The recall obtained by using the MFCC features is 67.2%. C5 and C6 are the classes that obtained the higher recall, zero and one errors respectively. For MFCC, at least 14 samples of each class of kitchen sounds (except by C5 and C6) are classified erroneously as C5. All the sound samples of C14 are miss classified (105 errors). C7 and C2 obtained 77 and 73 errors, respectively. For MSES, the recall achieved is 83.99%. C8 is the class with more errors, 89 in total, followed by C6 (61 errors) and C5 (46 errors). Comparing the average value achieved with distances of similarity and SVM, there is an increase of recall of 1.91% for MFCC and 6.02% for MSES.
5.4 K-Nearest Neighbors Results
As previously mentioned, the
Comparing the average value achieved with distances of similarity and KNN, there is an increase of recall of 0.48% for MFCC and 9.41% for MSES.
Table 7 shows the summary of the obtained results for both MFCC and MSES features and all classifiers: Similarity distances, ANN, SVM and KNN. We can observe first that all the classifiers have an improvement in the recall metric when working with MSES feature.
Second, the ANN classifier has the highest performance for both MFCC and MSES (73.42% and 88%, respectively), followed by a combination MSES-KNN (87.38%), then a combination MSES-SVM (83.99%), and finally, similarity distances-MSES with a score of 77.97%. Regarding MFCC, the second best performance was achieved with SVM (i.e., 67.2%).
Third and fourth best performance were achieved with KNN and similarity distances (65.77% and 65.29%, respectively). We attribute the good performance of ANN to the fact this machine learning technique works with variations that allow their learning to be more robust and effective than the other methods.
5.5 Test of Statistical Significance
To further analyze the differences between MFCC and MSES methods, we applied a non-parametric Mann-Whitney’s test with a significance level of
5.6 Optimization of ANN with GA and PSO
Previous results showed that the combination MSES-ANN (audio features-classifier) achieved the best score for all the combinations. In this part, we realized an optimization looking for the best artificial neural network with MSES. This optimization is performed using the genetic algorithm (GA) and particle swarm optimization (PSO). The use of the GA and PSO optimization algorithms are decided in consideration because these algorithms performed good results in optimization of parameters for machine learning algorithms [19, 20].
The optimization looks up for the following ANN’s values and parameters:
Number of neurons in the first hidden layer.
Number of neurons in the second hidden layer.
-
The transfer functions for the neurons in the first and second hidden layer, and for the neurons in the output layer. The transfer functions for optimizing are the next:
— Positive linear transfer function.
— Linear transfer function.
— Inverse transfer function.
— Log-sigmoid transfer function.
— Hyperbolic tangent sigmoid transfer function.
— Triangular basis transfer function.
— Hard-limit transfer function.
— Saturating linear transfer function.
— Elliot symmetric sigmoid transfer function.
— Symmetric saturating linear transfer function.
— Symmetric hard-limit transfer function.
— Elliot 2 symmetric sigmoid transfer function.
-
The learning algorithms implemented in the neural network:
— Levenberg-Marquardt backpropagation.
— One-step secant backpropagation.
— Gradient descent with adaptive learning rate backpropagation.
— Gradient descent with momentum and adaptive learning rate backpropagation.
— Scaled conjugate gradient backpropagation.
— Resilient backpropagation.
— Gradient descent backpropagation.
— Gradient descent with momentum back-propagation.
— Conjugate gradient backpropagation with Fletcher-Reeves updates.
— Conjugate gradient backpropagation with Polak-Ribiére updates.
— Conjugate gradient backpropagation with Powell-Beale restarts.
In Table 8, the parameters for the performance of GA are showed and Table 9 shows the parameters for the performance of PSO.
Population | 100 Individuals |
Individual | 6 Genes (real) |
Generations | 100 |
Assign Fitness | Ranking |
Selection | Stochastic universal sampling |
Mutation | 16.67 % |
Crossover | Single Point (80%) |
Population | 100 Particles |
Particle | 6 Dimensions (real) |
Iterations | 100 |
Constriction Coefficient | 1 |
Inertia Weight | 0.1 |
R1, R2 | Random in the range [0, 1] |
C1 | Lineal decrement (2–0.5) |
C2 | Lineal increment (0.5–2) |
Table 10 shows the results for acoustic event recognition for 10 experiments that combine MSES-ANN with both optimization techniques GA and PSO. The average in recall metric was 91.46% and 91.55% for GA and PSO, respectively,
Experiment | Algorithm | |
GA | PSO | |
1 | 92.20 | 90.89 |
2 | 91.31 | 89.35 |
3 | 91.67 | 91.85 |
4 | 91.01 | 93.21 |
5 | 91.67 | 91.90 |
6 | 91.31 | 91.37 |
7 | 91.19 | 91.13 |
8 | 91.13 | 90.71 |
9 | 91.19 | 91.13 |
10 | 91.90 | 93.93 |
Best result | 92.20 | 93.93 |
Average | 91.46 | 91.55 |
The best recall for the optimization of the neural network was obtained with PSO achieving a 93.93 % of recognition for the kitchen sounds. The parameters of the best ANN architecture with PSO are:
— 1st Hidden layer (1HL) with 186 neurons.
— 2nd Hidden layer (2HL) with 238 neurons.
— The transfer function in 1HL was saturating linear transfer function.
— The transfer function in 2HL was symmetric saturating linear transfer function.
— The transfer function in output Layer was symmetric saturating linear transfer function.
— The training learning algorithm was conjugate gradient backpropagation with Fletcher-Reeves updates.
Finally, 30 experiments were realized using ANN with the above configuration parameters with the aim of testing the optimization robustness. Table 11, presents only the best 10 results where one can observe that the average recall achieved for the optimized combination MSES-ANN was 93.46%, that is, 15.49% of improvement when compared with the average value achieved with distances of similarity and optimized ANN-MSES.
Experiment | Recall |
1 | 94.52 |
2 | 93.51 |
3 | 94.11 |
4 | 94.70 |
5 | 93.21 |
6 | 93.39 |
7 | 93.15 |
8 | 93.81 |
9 | 93.04 |
10 | 93.10 |
Best result | 94.70 |
Average | 93.46 |
Table 12 shows the results about True Positives (TP) and False Positives (FP) from the confusion matrix obtained for the best performance with the couple MSES-ANN and optimized with PSO. The results of the table showed that, excepting the C7 sound, all classes have a success ratio between 90% and 100% for the recognition of acoustic events that define each class. The optimization of the neural network helps to improve the recognition rate and to reduce the number of miss classified sounds (False Positives). An image-based representation of the confusion matrix of this experiment is showed in Figure 6. Notice that the color of the diagonal indicates that there is a high recognition rate for each of the classes.
Sounds | MSES | |
TP | FP | |
C1 | 96 | 1 |
C2 | 99 | 6 |
C3 | 104 | 0 |
C4 | 103 | 14 |
C5 | 104 | 5 |
C6 | 100 | 1 |
C7 | 63 | 0 |
C8 | 104 | 3 |
C9 | 102 | 4 |
C10 | 95 | 0 |
C11 | 102 | 6 |
C12 | 103 | 2 |
C13 | 104 | 3 |
C14 | 104 | 16 |
C15 | 103 | 3 |
C16 | 105 | 26 |
6 Conclusions
In this work, we identify acoustic events using the approach of audio signatures in combination with machine learning algorithms. When different instances of a sound class are not available, the audio signatures approach should be used since this approach only requires the original sound and degraded versions of it.
Audio signatures help us to cope with the small database of the kitchen sound sources, which in our case consisted of sixteen original sounds and some degraded versions of these. In order to complement the audio signatures approach, we studied the performance of machine learning algorithms when there is only an instance of the sound classes and degraded versions of them.
The two audio features considered in this work are MFCC and MSES. MFCC is one most cited audio feature when working with audio-based activity recognition, and the reason to be considered as our benchmark feature. MSES is an interesting audio feature being widely adopted because of its robustness to noise.
The results showed that the representation of acoustic events based on MSES is more convenient when working with different classification methods. Although the comparison between MSES and MFCC is not conclusive, it seems that MSES is an audio feature that is very robust for identifying acoustic events in a mixture of sounds.
One thing to note is that MSES captures the location of energy peaks in each sub-band that are less corrupted by noise, even in the presence of low SNR levels, something that affect the performance of MFCC. Nevertheless, both MFCC and MSES represent very well the non-stationary characteristics of audio signals.
A database with a mixture of everyday kitchen sounds was created using 3dB of SNR. The way in which this database is constructed should encourage readers to use it in future works since this database considers noisy contexts, something that to our understanding is not available in the literature. Yet there are databases with sound sources from different and independent tasks but never mixed, such the one provided by the DCASE2020 database.
The results presented here showed a way for identifying acoustic events when they are immersed in a mixture of sounds and they are not predominant, which is important for recognizing activities in real indoor environments. In the classification stage, four types of classifiers were used, Similarity Distances, k-Nearest Neighbors, Support Vector Machines and Artificial Neural Networks.
The results of Table 7 showed that MSES combined with Artificial Neural Networks has an score of 88% in recall metric which outperforms any other combination of classifiers with MSES or MFCC. In addition, a test of statistical significance was realized, getting a value of p = 0.0003, which makes us reject the null hypothesis and conclude that MFCC and MSES features have different level of robustness and that their performance do not depend on the type of classifier nor on the sound to be recognized.
Furthermore, the use of a genetic algorithm and a particle swam optimization improved the performance of audio features recognition supported by machine learning classifiers, being the combination MSES-ANN the one the best performance (93.46%). Table 10 showed that PSO performs better than GA achieving a average recall of 91.55%.
Finally, the experiments presented in this work focused on the evaluation MSES and MFCC audio features techniques that are supported by machine learning algorithms for the recognition of acoustic events on noisy environments. We considered the context of a kitchen context where different sound sources are present, for instance, when a person is preparing meals. In an attempt to make a more realistic scenario sound sources were mixed and applied a low SNR level. This is an acoustic recognition approach that would help better understand the nature of human activity in the home setting.
The identification of all the sounds that are present in the environment might help to develop systems that can assist people or that can be aware of potential dangers.