Introduction
Human emotions have an impact on day-to-day actions and decisions. They are important aspects of communication and human emotional intelligence, in other words, the capacity to understand and manage emotions is crucial for the success of personal interactions[1]. There are areas of opportunity in technological innovation in the study of emotions, for example, affective computing, which has the objective of equipping machines with emotional intelligence to improve human-computer interactions (HCI)[2]. Another area is human-robot interaction (HRI), which aims to make robots capable of interpreting and expressing emotions similar to those of human beings and thereby modulate relevant aspects of the interaction[3]. Thus, the possible applications of an interface capable of evaluating human emotional states are numerous, ranging from medical diagnoses, rehabilitation processes, and digital commerce to new teaching methods.
In HCI, non-invasive, reliable and accessible portable sensors play an important role in the study of emotions. This is because in many work environments the use of modern technologies has increased considerably, with the objective of improving the interaction between the user and the technologies. However, many of these systems impose high demands on cognitive states, which can lead to the arousal of a person’s negative emotions[4]. One of the most effective approaches to emotion recognition is based on the use of physiological signals. Among the numerous physiological signals, it has been reported that brain signals are found to be directly related with human emotions[5] [6] [7]. Favorable results have been reported when evaluating different classification algorithms, this during the application of diverse feature extraction techniques through electroencephalographic (EEG) signals[8] [9] [10].
The estimation of emotions in real-time involves processing a continuous stream of biosignals with the lowest latency possible. Research in system development for emotional state detection is mainly focused on recognition methodology[11]. On the other hand, the field of BCI systems using EEG signals is constantly evolving. For example, in[12], an algorithm for attention detection during mathematical reasoning is proposed. In addition, in[13], an analysis of EEG signals is performed using diverse classification techniques, achieving significant results in motion detection. Another relevant contribution is presented in[14], with the introduction of a new neural network model designed for classification with a limited amount of motor imagery data. In[15], a methodology based on EEG signals is presented to detect the level of attention in children, applying a multilayer perception neural network model. Finally, in[16], a methodology based on motor imagery for a BCI system is presented, using convolutional neural networks. These papers highlight the diversity of approaches in current research on emotional estimation and the development of BCI systems using brain signals.
However, segmentation plays an important role in achieving real-time or continuous monitoring of emotional states (that is to say, the selection of the time window), which has received little attention and requires further research. The majority of works reported utilize different time windows (TW) as inputs for model training[6] [7]. The consequences of employing different TW could influence the trained model to be unsuitable for application in real-time emotion recognition because the knowledge learned from the model is related to the sampled features for subsequent detection. In addition, combining different TW in the same analysis would cause the trained model to be inconsistent due to the changing characteristics of EEG signals in temporal sequences[17].
In order to avoid this problem, analyses at different TW lengths of temporal sequences have been developed using EEG signals. For example, Lin et al., report in their research a TW of 1 second to calculate the spectrogram of an EEG, in order to investigate the relationship between emotional states and brain activities, with an accuracy of 82.29 % in their model et al.[18]. On the other hand, Zheng et al. report a TW of 4 seconds without overlap in order to extract combined EEG features with eye tracking with the objective of carrying out emotional recognition tasks with an accuracy in their model of 71.77 %[19]. Zhuang et al. used a TW of 5s for feature extraction and emotion recognition based on empirical mode decomposition with an accuracy of 69 % in their model[20]. It is noted that different TW lengths have been used in EEG signal processing, but the appropriate length measurement for the detection of emotions is not established. Ouvan et al. carried out a study of the size of TW with the experiment-level batch normalization method in feature processing, in their findings they report that the best performing TW length was 2 seconds[21]. Healey et al. present a study of emotions (emotion recognition) with windows of 60, 180 and 300 seconds, but they do not report the performance of each TW[22]. Gioreski et al. report a laboratory study for stress detection with TW between 30 and 360 seconds, and in their findings indicate the window of 300 seconds presents better performance[23]. However, few studies have examined the effect of wavelength on the performance of classifier models during emotion recognition.
Among the most-used parameters for measuring a classifier model’s performance are accuracy (ACC), defined as the fraction of predictions that the model classifies correctly, the precision or positive predictive value (PPV) that is the percentage of correct classifications of the model within the predictions of positive emotions, completeness or sensitivity (Recall) which is defined as the proportion of emotions that were correctly identified as having a condition, true positive, over the total number of emotions that are actually positive[24]. Other research employs the Area Under the Curve (AUC) of Receiver Operating Characteristic (ROC) in order to evaluate the performance of classifier algorithms[5] [25]. Another parameter used is specificity, which measures the number of subjects who were correctly identified as having a negative emotion over the total number of subjects who actually present a negative emotion. However, for balanced studies that have on average almost the same amount of data for all categories (different emotions) the performance measures are ACC, AUC and Cohen's Kappa coefficient[8] [10] [12] [26] [27].
Another potential drawback in the study of emotions arises when variables are analyzed and reported at the group level rather than being used to evaluate the emotions in an individual. This is to say, the associations between physiological variables and emotions found through a group-level analysis may not generalize the case for evaluating emotions in an individual, as they cannot be sufficiently robust to reliably assess the emotional state at a given time for an individual[28].
Therefore, there is a gap in defining the size of the TW that will induce the best performance from the sorting algorithm for recognizing emotions. Furthermore, it has not been clearly established if the study of data must be made on an individual or on a group level. Accordingly, the aim of this research is to evaluate classification performance in detecting emotions via EEG signals at the group and individual levels by conducting a systematic comparison of TW values below 30 seconds. The performance metrics selected for this evaluation are ACC, AUC and Cohen's Kappa coefficient.
This article is divided in the following structure: first, an Introduction is provided that establishes the contextual framework of the research. The second section addresses the Materials and Methods, detailing the proposed approach to conduct the study. Next, in the third section, the Results and Discussions are presented, highlighting the observations and analysis derived from the research. Finally, the last section covers the Conclusions, summarizing the key findings and their implications.
Materials and methods
In order to carry out this study of emotion recognition, the data set is employed from[29]. For this data set, controlled experiments were designed to induce positive, negative, and neutral emotions from video clips. Participants ranged in age from 19 to 35 years (mean age 24.3). There were 8 women and 17 men.
To carry out the performance study of classification models with different TW the flow presented in Figure 1 is implemented.
The dataset contains the signals of 25 subjects from 14 electrodes of the EEG device. Every signal was broken down into alpha, beta, gamma and delta frequency bands. In this study, MATLAB 2017a libraries were used for the preprocessing of data, feature extraction and analysis of the classification algorithms. The workstation consists of a PC with an i7 processor, 8GB of RAM and 4GB of Nvidia GeForce.
Acquisition of EEG Data
The EEG sensing device used was the Emotiv EPOC+, which has 14 channels: AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, AF4, plus two references: Common Mode Sense (CMS) and Driven Right Leg (DRL) in P3 and P4 (see Figure 2). This device has been widely used in different research related to emotions and the study of pathologies[30] [31] [32]. The data obtained directly from the file of every subject were the preprocessed theta (4-8 Hz), alpha (8-12 Hz), low beta (12-16 Hz), high beta (16-25 Hz) and gamma (25-45 Hz) band signals.
Data Segmentation
The main objective of this study was to evaluate and compare the performance of different classifiers for emotion recognition with different TW shorter than 30 seconds. In Table 1 the lengths of TW considered is presented as well as the number of blocks obtained. Therefore, the format of the features in each trial was defined as (14x5x300); 14 representing the number of electrodes, 5 representing the frequency bands and 300 being the number of features extracted from the corresponding trial. In total 10 different TW lengths were examined in order to investigate the effect on the performance of classifier models in the study of emotions employing EEG sensors at a between-subject and within-subject level.
Data Segmentation
The main objective of this study was to evaluate and compare the performance of different classifiers for emotion recognition with different TW shorter than 30 seconds. In Table 1 the lengths of TW considered is presented as well as the number of blocks obtained. Therefore, the format of the features in each trial was defined as (14x5x300); 14 representing the number of electrodes, 5 representing the frequency bands and 300 being the number of features extracted from the corresponding trial. In total 10 different TW lengths were examined in order to investigate the effect on the performance of classifier models in the study of emotions employing EEG sensors at a between-subject and within-subject level.
Feature Extraction
Since EEG signals are complex due to nonlinearity and randomness of time series data the calculation of time series entropy is incorporated[32][33]. The TW is modified separately in each analysis and the entropy is calculated as a characteristic. Several entropy functions exist; however, the Log Energy function stands out for its excellent performance in the analysis of EEG signals. This is due to its remarkable sensitivity to energy changes, lower susceptibility to high-frequency noise and rapid amplitude variations, and its ability to characterize the complexity associated with the subbands of EEG signals[34][35]. This entropy function is based on the wavelet theory. There are different types of entropy, in this study we use Log energy. This type of entropy is based in wavelet theory where we assume a signal x=[x1 x2 x3 … xn] and a probability distribution function P(xi) where i is the index of the signal’s elements, then the entropy is defined as[36]:
under the convention that log(0)=0. For this study the five frequency bands of every electrode were processed and the log energy entropy characteristic was extracted.
The emotional state analysis is carried out on an individual and group level. On the group level the classification models with different TW are evaluated. For the performance evaluation of the models on an individual level, statistical tests are performed to determine whether there is a statistically significant difference in classification performance between different window lengths and between different classification algorithms.
Classification Analysis
In order to carry out the training and evaluation of the models, the k-fold cross-validation technique is used with a value of k=5, where k is number of folders into which the data are separated. The classifier models compared in this study were K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF). The use of supervised classifiers was chosen because of their ability to achieve more accurate and specific learning, as they are trained to establish direct connections between known patterns and labels. Moreover, these classifiers are widely used in the study of emotions, according to the literature[10] [37] [38] [39] [40] [41]. For the classification analysis the Machine Learning Toolbox 11.1 module from Matlab was used. The configuration parameters for every model are listed in Table 2.
Results and discussion
Between-subject Study
The performance results for the models on the between-subject level are shown in Table 3. The ACC, AUC and Cohen's Kappa coefficient of each model are presented as performance metrics. The results indicate that, regardless of the classifier model used, the TW that enhance their performances are between 2 and 15 seconds, with the 10s TWs standing out. The KNN model achieves the highest level of ACC, reaching 87.7 % in the 10 seconds TW, while the DT model exhibits the worst performance at 62.1 % in the 20 seconds TW. In terms of AUC, the best-performing model is RF with 0.93 for TW of 2 to 5 seconds, while the DT model shows the worst performance with 0.61 using a 20 seconds TW. Finally, the model with the best Cohen's Kappa coefficient is the KNN with 0.75 in the 10 seconds TW, and the DT model obtains the lowest result with 0.21 in the 30 seconds TW.
ACC-ACC-Cohen’s Kappa coefficient | |||||
---|---|---|---|---|---|
TW | KNN | SVM | LR | RF | DT |
1 | 83.3- 0.83- 0.66 |
64.1- 0.69- 0.28 |
63.7- 0.69- 0.27 |
83.6- 0.92- 0.67 |
63.4- 0.66- 0.27 |
2 | 83.3- 0.83- 0.67 |
65.1- 0.71- 0.30 |
64.7- 0.71- 0.30 |
84.3- 0.93- 0.68 |
63.4- 0.66- 0.27 |
3 | 84.8- 0.85- 0.70 |
65.9- 0.72- 0.31 |
65.2- 0.72- 0.30 |
84.8- 0.93- 0.69 |
62.8- 0.65- 0.28 |
4 | 85.4- 0.85- 0.71 |
66.1- 0.72- 0.32 |
65.7- 0.72- 0.31 |
84.4- 0.92- 0.69 |
63.6- 0.66- 0.28 |
5 | 86.5- 0.86- 0.72 |
66.9- 0.73- 0.33 |
66.5- 0.73- 0.31 |
84.6- 0.93- 0.70 |
63.9- 0.66- 0.25 |
10 | 87.7- 0.88- 0.75 |
67.2- 0.73- 0.35 |
65.8- 0.73- 0.33 |
83.4- 0.87- 0.65 |
63.8- 0.67- 0.29 |
15 | 86.7- 0.87- 0.74 |
66- 0.72- 0.30 |
64.1- 0.72- 0.31 |
80.6- 0.89- 0.63 |
62.9- 0.67- 0.25 |
20 | 84.3- 0.84- 0.69 |
63.2- 0.7- 0.29 |
64- 0.7- 0.29 |
78.8- 0.85- 0.58 |
61.2- 0.62- 0.27 |
30 | 80- 0.8- 0.54 |
59- 0.64- 0.17 |
64.4- 0.69- 0.23 |
73.8- 0.62- 0.44 |
62- 0.73- 0.21 |
The results indicate that, in general terms, the performance of the models tends to decrease significantly for TW greater than 20 seconds.
Within-subject Results
Emotion recognition results within subject variability are shown in Table 4. It is observed that the KNN model achieves its best performances in temporal TW ranging from 5 to 15 seconds, with the optimal result obtained in the 10 seconds TW. On the other hand, the SVM model shows better performances in TW from 2 to 10 seconds, with the 5 seconds TW being the most outstanding in terms of ACC, AUC, and Cohen's Kappa coefficient. The LR model performs better in TW from 1 to 5 seconds, presenting its best result in the 4 seconds TW. For the RF model, its best performance is found in TW from 1 to 10 seconds, although the 4 seconds TW stands out the most. Finally, the DT model exhibits better performance in TW from 2 to 10 seconds, achieving the best result in the 5 seconds TW.
ACC-ACC-Cohen’s Kappa coefficient | |||||
---|---|---|---|---|---|
TW | TW | TW | TW | TW | TW |
1 | 82.4- 0.82- 0.65 |
83.48- 0.90- 0.67 |
81.3- 0.89- 0.66 |
86.95- 0.938 0.74 |
78.83 0.82 0.58 |
2 | 82.36- 0.82- 0.64 |
84.5- 0.91- 0.69 |
81.4- 0.88- 0.75 |
87.7- 0.94- 0.75 |
80.43- 0.83- 0.61 |
3 | 83.0- 0.83- 0.67 |
84.6- 0.91- 0.70 |
81.4- 0.84- 0.62 |
86-9- 0.83- 0.74 |
81.20- 0.814- 0.62 |
4 | 83.6- 0.83- 0.682 |
85.4- 0.91- 0.682 |
82.3- 0.85- 0.610 |
87.26- 0.938 0.745 |
80.99 0.832- 0.620 |
5 | 85.6- 0.85- 0.69 |
85- 0.91- 0.707 |
81.1- 0.85- 0.629 |
87.24- 0.936 0.745 |
82.2- 0.832 0.636 |
10 | 85.8- 0.86- 0.715 |
83.8- 0.91- 0.702 |
61.9- 0.62- 0.209 |
85.47- 0.93 0.709 |
79.17- 0.8 0.583 |
15 | 86.1- 0.86- 0.704 |
81.42- 0.88- 0.654 |
67.6- 0.69- 0.35 |
82.99- 0.88- 0.659 |
74.14- 0.748- 0.483 |
20 | 82.7- 0.82- 0.664 |
79.97- 0.85- 0.611 |
71.4- 0.75- 0.458 |
81.37- 0.872- 0.627 |
70.89- 0.718- 0.423 |
30 | 79.5- 0.79- 0.642 |
72.7- 0.76- 0.395 |
67.7- 0.71- 0.368 |
72.39- 0.77- 0.447 |
67.88- 0.679- 0.358 |
These results demonstrate that, regardless of the model chosen from these five, the TW that promote better performance in terms of ACC, AUC, and Cohen's Kappa coefficient are between 2 and 15 seconds. Moreover, the 10-second TW appears to be the most suitable for this type of configuration.
In order to identify possible significant disparities between ACC and AUC values associated with different threshold values (TW), Friedman nonparametric statistical tests were performed for repeated measurements of a single factor. This test was chosen based on its robustness to violations of normality and its lower sensitivity to outliers compared to parametric tests, such as ANOVA. The results of these tests are presented in Table 5, considering a significance level of 0.05 for the evaluation of statistical significance.
Accuracy | ||||
---|---|---|---|---|
ACC | AUC | |||
TW length (s) |
X2 | P-value | X2 | P-value |
1 | 38.93 | <0.001 | 78.17 | <0.001 |
2 | 36.69 | <0.001 | 68.70 | <0.001 |
3 | 25.491 | <0.001 | 66.80 | <0.001 |
4 | 29.0 | <0.001 | 67.01 | <0.001 |
5 | 21.285 | <0.001 | 55.83 | <0.001 |
10 | 58.42 | <0.001 | 77.66 | <0.001 |
15 | 33.543 | <0.001 | 31.717 | <0.001 |
20 | 31.5 | <0.001 | 42.65 | <0.001 |
30 | 23.04 | <0.001 | 19.87 | <0.001 |
It is observed that in all TW, there are significant differences between the models. This indicates that not only does the length of the TW interfere with performance, but the chosen model also plays a role.
The Table 6 presents the results of the Friedman nonparametric hypothesis test for multiple TW sizes and Wilcoxon test for the TW pairs generated by the KNN model.
Results | ||
---|---|---|
Null Hyphotesis | W | P-value |
μ5 = μ10 = μ15 = μ20 | 5.795 | 0.122 |
μ1 = μ2 = μ3 = μ4 | 3.603 | 0.308 |
μ5 ≤ μ3 | 300 | <0.001 |
μ20 ≤ μ30 | 209 | 0.016 |
When comparing the equality of the mean of ACC in TW of 5, 10, 15, and 20 seconds, it is not possible to reject the null hypothesis. In the case of testing the equality hypothesis between TW of 1, 2, 3, and 4 seconds, it is also not possible to reject the difference. However, the TW of 5 seconds presents better performance than that of 3 seconds, and that of 20 seconds presents better performance than that of 30 seconds, as there is no significant difference between the TW of 5, 10, 15, and 20 seconds it can be considered that these are the ones that present better performance in terms of ACC for emotion detection.
The comparison of means of ACC of the SVM models for different TW is presented in Table 7. It is observed that the mean ACC of 20 seconds is higher than the 30 seconds, and that there is no significant statistical difference between the TW of 5, 10, 15 and 20 seconds
Results | ||
---|---|---|
Null Hyphotesis | W | P-value |
μ5 = μ10 = μ15 = μ20 | 2.776 | 0.427 |
μ1 = μ2 = μ3 | 4.0607 | 0.131 |
μ5 = μ4 | 134.5 | 0.668 |
μ20 ≤ μ30 | 173 | <0.001 |
Table 8 shows the results of the comparisons of the ACC means of the LR model between different TW. It is observed that TW less than or equal to 5 seconds present a better performance in emotion detection.
Results | ||
---|---|---|
Null Hyphotesis | W | P-value |
μ15 = μ10 = μ30 | 4.151 | 0.126 |
μ1 = μ2 = μ3 = μ4 = μ5 | 8.190 | 0.052 |
μ5 ≤ μ10 | 2.0 | <0.001 |
Figure 3 shows the performance results of the AUC averages of the five models. The best performing classifier is the SVM even though it was decreasing at 10, 15, 20 and 30 seconds TW. However, the RF model has the highest levels of AUC. The KNN model shows a growth in the AUC from TW 4 to 15 seconds. Finally, as the TW length increases, the three classifiers tend to decrease their performance in terms of AUC.
Conclusions
In this study, the performance of three emotion classification models is evaluated, considering different TW sizes in data segmentation and two experimental setups. The first configuration involved the participation of 25 subjects in a between-subjects design. The results indicate that the window size significantly influences the performance of the classifiers. For example, the KNN model shows optimal results in TW sizes between 4 and 15 seconds, while the SVM, LR, RF and DT models excel in 4 to 10 seconds TW. In conclusion, for a between-subjects configuration, TWs of 4 to 15 seconds are recommended.
In the within-subject configuration the highest performance results are presented in TWs between 4 and 15 seconds for the KNN model with an ACC between 83.6 % and 86.12 %, respectively. However, for the SVM model the TW with the highest performance are between 2 and 10 seconds with an average AUC of 0.91. It is also observed that using the KNN model the performance results in terms of ACC and AUC do not differ significantly between the between-subject and within-subject configurations. However, for the LR and SVM models there is a significant difference when comparing the configurations: both models present better performance in the within-subjects configuration.
In general terms, it can be concluded that, in the study of emotions using EEG signals, regardless of the experimental setup or the classifier model employed, the TW that exhibit optimal performance in the classifiers, measured in terms of ACC, AUC and Cohen´s Kappa coefficient, are in the range of 2 to 15 seconds. Ultimately, it is observed that, by increasing the duration of the TW above 20 seconds, all three models experience a decrease in performance. Likewise, the use of TW equal to or longer than 20 seconds is not recommended for emotion recognition.
On the other hand, future research will focus on addressing the limitations identified in this study. This work includes: a) conducting additional analysis with a larger sample of participants, b) exploring a comparative analysis between supervised and unsupervised classification methods, and c) considering multiple features of entropy, such as Threshold Entropy and Shannon Entropy.
Author contributions
A. J. S. Conceptualization (literature review, problem definition, theoretical framework selection, planning of instruments and tools), data curation, formal analysis, investigation, methodology (data analysis planning), project administration, writing original draft, writing review and editing, visualization, and resources. V. A. G. P. Conceptualization (literature review, review of alternative methodologies), methodology (methodology adjustments and peer review), supervision, validation, writing review, and editing. O. A. D. R. Conceptualization (feasibility analysis and preliminary methodological design), project administration, supervision, validation, investigation, funding acquisition, writing review, and editing.