1 Introduction
Researchers in Natural Language Processing (NLP) focused in the emotions classification, have been systematically left aside the studies of literary corpus for the development and evaluation of their models, mainly because the complex level of literary discourse. Instead, the use of corpora constituted by encyclopedic documents (mainly Wikipedia), journals (newspapers or magazines) or specialized (legal, scientific or technical documents) has been more frequently employed in recent years. [10, 2, 8]. In this work we introduce a new literary emotion corpus in order to evaluate and validate the NLP algorithms in the literary emotions classification tasks.
This paper is structured as follows. In Section 2 we show some works related to development and analysis of Spanish corpora. In Section 3 we describe the corpus LiSSS and in Section 4 the learning corpus CitasIn. The test and validation process are described in Section 5, as well as their respective results. Finally in Section 7, we propose some ideas for future works before to conclude.
2 Related Works
Several corpora in Spanish have been built and made available to the scientific community [3] however, a few number of them have been classified considering categories of emotions. For example, the corpus SAB composed by tweets in Spanish was introduced in [6]. These tweets represent critics toward different commercial brands. For each tweet, the perceived emotion must be indicated. The corpus SAB consists of 4 548 annotated tweets using 8 predefined emotions: {Trust, Satisfaction, Happiness, Love, Fear, Disaffection, Sadness, Anger}.
Another data set concerning tweets is the corpus TASS [11]. It contains about 70 000 tweets classified using automatic methods into the following categories: {Positive, Negative, Neutral, None}. Tweets in TASS corpus concern different topics: Politics, Economy, Sport, Music, etc.
A polarity emotion analysis (at word level) is described in [1]. The corpus employed was built with lexicons in 40 languages, annotated into the categories: {Positive, Negative}.
Our LiSSS corpus consists only of literary texts, which gives it a particular characteristic more useful for studying the algorithms of automatic emotion classification and generation of literary text. Moreover, for the classification, five categories of emotions were defined, instead of a binary (positive-negative) classification. This characteristic of LiSSS could be useful for more complete analysis.
3 LiSSS Corpus
LiSSS corpus is a small but well-controlled corpus, exclusively composed of literary sentences in Spanish selected from universal literature and tagged manually by a pull of annotators.
3.1 Corpus Annotation
The LiSSS corpus was constituted manually using literary texts in Spanish from around 200 Spanish-speaking authors. We also include not Spanish-speaking authors (keeping only official or good quality translations) in order to enrich the emotion content, the vocabulary and the expressive sense of the corpus.
This corpus is constituted by a P number of “sentences”. Sentences are considered in a large sense. Actually, each “sentence” in this corpus is a complex linguistic object compound of one or several sentences, phrases or paragraphs. Henceforth in this paper, we will call sentence this linguistic object. The sentences were taken from quotes, stories, novels and poems. The literary genre is homogeneous. P sentences were classified by n annotators. All annotators in this study have a university level education and they are Spanish native speakers. Each sentence was read and manually classified into five categories: {Anger (A), Love (L), Fear (F), Happiness (H), Sadness/Pain (S)}.
Since the sentences may belong to one or more emotions, the annotators could tag the sentences using all perceived emotions. The sentences were manually processed to create n text and XML files, one per annotator. In the text version, each file contains P lines, with information structured in three fields:
ID Sentence # Author
Each field is separated by a tab character. The ID field is composed of a sequential number (1,2,3,...) followed by a code (A, L, F, H, S) corresponding to perceived emotions. In the XML version, the same structure is preserved using suitable XML tags.
If a sentence is considered as multi-emotion, it will have as many codes as categories it belongs to. The sentences were selected in order to maintain a balance between the categories, but this is not always guaranteed.
As mentioned, our “sentences” are often composed by many short-sentences, giving mini-paragraphs. This allows to respect as much as possible the coherence and the corresponding emotions. For example, sentence 455 of the emotion Fear (F), by J.P. Sartre:
Todos los hombres tienen miedo. El que no tiene miedo no es normal. No tiene nada que ver con el coraje.
is a three-sentence paragraph:
— Todos los hombres tienen miedo (All men are afraid).
— El que no tiene miedo no es normal (The one who is not afraid is not normal).
— No tiene nada que ver con el coraje (It has nothing to do with courage).
Approximately 10% of sentences of LiSSS corpus are mini-paragraphs. The multi-annotated corpus has currently P = 500 literary sentences, one per line. The support sentences (composed by non-literary language) as well as those too short (N ≤ 3 words) or too long (N ≥ 50 words) were ignored. A complex and aesthetic vocabulary where certain literary figures like anaphora or metaphor can be observed in this corpus. The characteristics of LiSSS corpus are shown in Table 1.
Sentences | Paragraphs | Words per file | Total Words | Spanish Speaking authors | Translated authors | Annotators + vote |
500 | 49 (10%) | 9 401 | 112 812 | 37 | 164 | 12+2 |
We detected some sentences that were tagged with opposed emotions. This phenomenon derived from ambiguity is commonly observed in this genre of texts, becoming a difficult task for classification. Authors like Leon Tolstoy are known by their strong emotional style: for example, Tolstoy often writes about the contiguity between love and death. To best interpreting this ambiguity, the classification performed by the annotators was compiling using two voting strategies. Therefore, two voting strategies may be computed in order to produce an integrated classification (see Section 5).
We tried a characterization of LiSSS corpus using a pull of artificial “annotators” for an experimental classification, i.e. using a set of artificial random taggers without knowledge of emotions, distribution nor textual content. The idea is to have an extended test corpus in order to measure the impact of performances of baseline algorithms. We called the artificial annotate corpus LiSSS/Art, and it will be used in the experiments of Section 5.
The LiSSS corpus has the advantage of being homogeneous in terms of genre, containing only “literary sentences”, but it is heterogeneous in terms of emotions classes. In other emotion corpora, the sentences are overloaded of support sentences: linguistic structures that give a fluency to the reading and provide the necessary relations between ideas expressed in literary sentences, this is a disadvantage when it is pretended to analyse and process literary texts. The corpora composed by tweets are not ideal to be used with literary goals due to the presence of noise like: symbols, special characters, cut phrases, pasted words, wrong syntax, etc. This noise was avoided by a repeated and carefully reading from LiSSS corpus.
However, LiSSS corpus has the disadvantage of having a reduced number of sentences. It makes it not suitable for training algorithms based on Machine Learning (ML). But the goal of LiSSS corpus is not to be employed in the learning process but in testing the quality and performance of literary or emotions analysis algorithms.
3.2 Agreement
We have defined the agreement ci(k, e) as a triplet of k = 1...n annotators ak, ej, j ∈ {A, L, F, H, S} emotions and i sentences, i = 1...P, as follows:
where 0 ≤ ci ≤ 1 is a value that represents the emotion agreement between the k annotations corresponding to the phrase i. 0 means no agreement, 1 means perfect agreement.
Considering k = 1...n annotators,
we calculate the agreement mean weighed over all P sentences as:
For example, if n = 8 humans, ak=1...8 have annotated the sentence #75:
75 En sustancia, es una misma cosa odio y amor. # Giordano Brunofn as follows:
a1=ALS, a2=AL,
a3=AL, a4=AL,
a5=AL, a≠AL,
a7=L, a8=F.
Considering the Equation (1), for
each emotion we have:
Human Voters
We obtain an agreement mean value C = 82.2% computed with n = 12 annotators over P = 500 sentences. Table 2 shows the agreement among the annotators (the matrix values) and the annotators’ agreement in relationship with voting strategy (the last column). We can see that annotator a11 differs from others: his lowest agreement value is less than 59.5%, with annotator a12. The higher agreement value is obtained with annotators a2 and a10, getting 89.5%. The annotators with the highest vote agree are a2 and a4, having near 91% values. The “worst” annotators (in the sense of the agreement) are a1, a11 and a12. These information are used in a pilot test classification described in Section 5.
# | a1 | a2 | a3 | a4 | a5 | a6 | a7 | a8 | a9 | a10 | a11 | a12 | Voting |
a1 | • | 71.4 | 69.7 | 69.9 | 72.1 | 66.9 | 70.5 | 73.3 | 69.2 | 69.9 | 62.8 | 60.1 | 72.0 |
a2 | 71.4 | • | 87.3 | 88.1 | 84.6 | 75.0 | 85.2 | 83.0 | 87.1 | 89.5 | 63.2 | 61.4 | 91.0 |
a3 | 69.7 | 87.3 | • | 85.5 | 83.9 | 72.4 | 83.9 | 82.1 | 83.9 | 85.9 | 64.6 | 61.6 | 88.2 |
a4 | 69.9 | 88.1 | 85.5 | • | 84.9 | 76.3 | 84.8 | 82.1 | 86.1 | 85.5 | 64.2 | 61.9 | 91.1 |
a5 | 72.1 | 84.6 | 83.9 | 84.9 | • | 74.3 | 82.4 | 83.6 | 82.7 | 83.6 | 65.1 | 61.8 | 87.7 |
a6 | 66.9 | 75.0 | 72.4 | 76.3 | 74.3 | • | 74.5 | 74.3 | 74.1 | 73.3 | 62.2 | 62.4 | 77.6 |
a7 | 70.5 | 85.2 | 83.9 | 84.8 | 82.4 | 74.5 | • | 79.6 | 84.9 | 84.4 | 64.7 | 62.6 | 88.1 |
a8 | 73.3 | 83.0 | 82.1 | 82.1 | 83.6 | 74.3 | 79.6 | • | 79.2 | 81.5 | 65.6 | 62.1 | 85.0 |
a9 | 69.2 | 87.1 | 83.9 | 86.1 | 82.7 | 74.1 | 84.9 | 79.2 | • | 85.6 | 63.8 | 61.8 | 88.9 |
a10 | 69.9 | 89.5 | 85.9 | 85.5 | 83.6 | 73.3 | 84.4 | 81.5 | 85.6 | • | 63.7 | 61.8 | 88.7 |
a11 | 62.8 | 63.2 | 64.6 | 64.2 | 65.1 | 62.2 | 64.7 | 65.6 | 63.8 | 63.7 | • | 59.5 | 65.2 |
a12 | 60.1 | 61.4 | 61.6 | 61.9 | 61.8 | 62.4 | 62.6 | 62.1 | 61.8 | 61.8 | 59.5 | • | 63.2 |
Mean | 71.3 | 81.3 | 80.0 | 80.7 | 79.9 | 73.8 | 79.8 | 78.9 | 79.9 | 80.4 | 66.6 | 64.8 | 82.2 |
There are several sentences having overlapped emotions. They were tagged by the voting algorithm, processing all annotator’s classification. The columns A/x, F/x, H/x, L/x, S/x in Table 3 represent single emotions vs overlapped emotions. For example, the voters have tagged 89.5 sentences only as L and 48.5 sentences combining L with other emotions x. The voters have tagged 153 sentences of corpus as multi-emotion. An example of this kind of sentences is the number 329, tagged with an identifier belonging to emotions Anger A and Love L:
329AL Del amor al odio, solo hay mas amor. # Mario Benedetti
(From love to anger, there’s only more love)
The matrix on Table 4 shows the mean class distribution calculated by dividing the numbers of sentences tagged for each emotion by the number of annotators. So we have ≈ 18% multi-emotion sentences in LiSSS corpus. Then, for each class we obtain the overlapping degree considering, a = the mean of sentences mono-class, and b = the mean of sentences multi-class, so we calculate b/(a + b). This represents the fraction of sentences combining one class with the others. This ambiguity is mainly observed in the pair of emotions Happiness–Sadness/Pain and Love-Sadness/Pain, with an overlap of HS=18.7 and LS=19.9 multi-emotion sentences, respectively. Literary complexity and multi-emotion combined represent a challenge for classification algorithms.
Artificial Voters
For the LiSSS/Art corpus, we have an average agreement CArt (Equation 3) computed over the Ak voters, k = 1..15 (20 random draws).
Since the artificial annotators are equally likely, it is not necessary to show the complete agreement matrix, but only their average value, as showed in Table 5. We computed an average agreement of 53.2% for the voting strategy, against the 85.2% for vote of n = 12 human annotators. These values indicate that random annotators do not have a real consensus (they only agree on half of their vote), on the other hand, the humans show a strong consensus.
3.3 Voting Strategies
We have defined two voting strategies in order to compare reasonably the information furnished by n annotators. Therefore, the sentences have at least one emotion: the annotations without emotions selected are therefore avoided. The first one is a simple majority vote and the second one is a more elaborated democratic vote.
Majority Vote
This is a winner-take-all like strategy: the output is computed as the most weighting emotion. The output class is therefore always mono-label.
Democratic Vote
We fixed a threshold t = 0.5 (50%). We keep the emotion(s) selected by at least a fraction of n voters ≥ t. We computed the output as follows: we calculate the probability p(e) = Count(e)/n; e ∈ {A, L, F, H, S} over all voters. If there are one or more emotions e having p(e) ≥ t, the process is finished and output is the concatenation of emotions having p(e) > 0.5.
If not, the threshold is down to t = 0.3 and the output is re-computed. Finally, if there are not emotions below this threshold, we down now t = 0.2 to re-compute the output. t = 0.2 seems to be a suitable threshold in the hypothetical condition where a human have annotated a sentence with all possible emotion. In this case, each emotion e has at least a probability p(e) = 0.2. Using a democratic vote the output may be multi-labelled.
4 Experimental Setup
4.1 Classification Algorithms
The LiSSS corpus was tested using several classification algorithms available in Weka librariesfn. In
particular, we have employed:
Naïve Bayes, classical implementation,
Naïve Bayes Multinomial (NBM), oriented to textual classification,
Support Vector Machine (SVM), using a polynomial kernel with a multinomial logistic regression calibrator,
A mixture of the 3 precedent classifiers.
We decided to use the Naïve Bayes model because its wide implementation in several classification works. A different implementation of Naïve Bayes known as Multinomial Naïve Bayes was performed due to its high precision score on text mining tasks, considering the estimated frequency of terms [9]. We also tested with a standard implementation of SVMfn to compare the performance of models on the LiSSS corpus.
These algorithms need to be trained to produce a classification model. The training process must be performed with an independent corpus from the test corpus. We decided to build a learning corpus suitable for this task, adapting it to the five categories of the LiSSS corpus.
4.2 Learning Corpus
For the training process, we built a learning corpus. The https://citas.in website contains several thousand documents in Spanish, (mostly literary documents), sentences, paragraphs, quotes, phrases, etc. A large number of documents belonging to different categoriesfn (friendship, lovers, beauty, success, happiness, laughter, enmity, deception, anger, fear, etc.) were recovered from this websitefn. Documents were manually clustered into the five classes of the LiSSS corpus, from their own categories (last column of Table 6).
CitasIn | Sentences | Words | Words per sentence | Categories |
Emotion | 72 790 | 1 352 810 | 18.6 | https://citas.in/temas//uri] |
L | 14 738 | 264 339 | 29.2 | alma, amantes, amistad, amor, belleza, beso, esperanza, pasión |
H | 13 647 | 256 697 | 18.8 | felicidad, amistad, diversión, sonrisa, risa, motivación, victoria, éxito, optimismo |
A | 15 043 | 280 784 | 18.7 | egoísmo, enemistad, engaño, envidia, venganza, guerra, infierno, mentira, guerra, odio, muerte, infierno, mentira |
F | 14 773 | 275 059 | 18.6 | necesidad, miedo, dolor, fracaso indecisión, problema, soledad, suicidio |
S | 14 589 | 275 931 | 18.9 | despedida, tristeza, pena, enfermedad, fracaso, pérdida, sufrimiento, olvidando, llorar, lágrima |
The result is CitasIn corpus, with an adequate size to be used on training phase for classifiersfnfn.
Table 6 shows some features of CitasIn corpus. We pre-processed the CitasIn corpus before the training phase. The texts were coded in utf-8 format, we removed the special symbols, as well as the stop words using the Weka libraries and stop lists. We normalized the words by transforming the capital letters into small letters. Finally, a tokenisation process was applied using Weka. Of course, for the learning process, we have eliminated from the CitasIn corpus, the common sentences with the LiSSS corpus.
5 Results and Discussion
We characterized the LISSS corpus through two different experiments. The first one is a test using the LiSSS corpus with the classes defined by the two voting strategies considering the classification of all annotators. The second one is a pilot test using a sub-set of LiSSS corpus, where atypical annotators were suppressed in order to measure the impact of these inconsistent classifications. In both cases, the CitasIn was employed as learning corpus. We measured classical Precision, Recall and F-score valuesfn.
5.1 Classification using All Annotators
We experimented using the algorithms presented in Section 4 to validate the performances of human annotators and the performance of an artificial “mean” annotator (the mean output of 15 artificial annotators) Section 3. The LiSSS corpus, re-annotated by the two voting strategy (n = 12 humans annotators) was employed as test corpus. Table 7 shows the average F-score obtained for each human annotator and the mean of artificial annotators, taking as references the democratic and majority vote.
Annotator | Majority vote | Democratic vote |
a1 | 62.71 | 31.85 |
a2 | 93.41 | 57.93 |
a3 | 89.96 | 50.23 |
a4 | 92.86 | 73.57 |
a5 | 85.92 | 65.28 |
a6 | 79.11 | 45.62 |
a7 | 88.16 | 64.15 |
a8 | 82.56 | 64.62 |
a9 | 90.49 | 69.74 |
a10 | 89.04 | 56.16 |
a11 | 49.24 | 21.75 |
a12 | 52.57 | 18.00 |
⟨HUMAN⟩ | 79.64 | 51.57 |
⟨ARTIFICIAL⟩ | 24.76 | 7.08 |
We can see that is more difficult to obtain an agreement between annotators (human or artificial) using the democratic vote. Therefore, the test was realized using only the majority class output (the majority emotion label per sentence). Table 8 shows the average F-score obtained for classifier algorithms on each emotion.
Algorithm | A | F | H | L | S | F-score mean |
SVM | 55.59 | 55.34 | 57.44 | 50.65 | 4.71 | 44.74 |
NB | 43.08 | 60.08 | 53.68 | 60.29 | 16.33 | 46.69 |
NBM | 54.05 | 66.34 | 77.78 | 67.73 | 33.03 | 59.79 |
NB+SVM+NBM | 51.76 | 60.43 | 67.35 | 70.05 | 22.68 | 54.45 |
The best result was obtained by NBM algorithm with F-score=59.84. This seemingly mediocre result, shows the difficulty of classifying emotions in literary corpus. We detected two main problems in the classification of this type of texts. Firstly, the complexity of lexicon appreciated in the corpus. Secondly, the ambiguity: the mass of 30% of multi-emotion sentences provokes confusion in classification methods. This behaviour can be proved observing the results for Sadness emotion, with the higher overlapping (33.03%) score, and the lowest F-scores, between 4.71 for Naïve Bayes and 16.33 for SVM. Finally, the mixture of algorithms, NB+SVM+NBM obtains the second best performance, with a mean F-score value = 54.51.
5.2 Pilot Test using Selected Combinations of Annotators
For the second experiment, the idea was to verify how much the classification results could be altered eliminating the annotators having the lowest or the highest agreement values in the voting strategy. However, we think that there are not “bad” or “good” human annotators in this subjective classification task, but only consistent or inconsistent emotion perceptions. Also, we study the impact on the performance for classification algorithms using the LiSSS/Art artificial annotated corpus (see Section 3).
Therefore, we have employed 3 supplementary test corpora. The first one, LiSSS/– excludes the 3 annotators having the lowest agreement on the vote (a1, a11 and a12).
The second one, LiSSS/+, excluding the 3 annotators having the highest agreement on the vote (a2, a4 and a9); and the last corpus LiSSS/Art, corresponds to voting strategy using all 15 artificial annotators.
In this experiment, we tested only the algorithm that obtained the best F-score performance, i.e. the NBM algorithm (see Table 8). The results per emotion are showed in Table 9.
Test corpus | A | F | H | L | S | F-score mean |
LiSSS/– | 53.33 | 66.67 | 77.42 | 67.21 | 34.55 | 59.83 |
LiSSS/+ | 52.17 | 65.69 | 73.10 | 71.04 | 30.08 | 58.41 |
LiSSS/Art | 30.27 | 27.68 | 16.22 | 9.33 | 9.84 | 18.66 |
It could be observed that suppression of “inconsistent” annotators (LiSSS/– test) impact slightly the F-score of NBM algorithm (it pass from 59.79 to 59.83) and emotions A and L are slightly best classified. On the other hand, the suppression of “consistent” annotators will fall the performances to 58.41 (LiSSS/+ test).
Finally, the F-score performances measured on LiSSS/Art corpus are the lowest of all experiments, as expected. These results confirm the real complexity of this classification task, and also that could be a good idea, to verify the annotators’ agreement in order to constitute a more coherent testing set.
6 Corpus Availability
The version 0.5x12 of LiSSS corpus (distributed in several files encoded utf8, Linux EOL, n = 12 annotators) is available on our websitefn under GPL3 public license:
— n files containing: ID emotion codes, the sentence and the author in plain text and XML.
— 2n files POS tagged (2 formats) of annotated files using Freeling 4.1.
— 2 files containing the output of n voters (democratic and majority vote): ID, emotion(s), sentence and author in plain text and XML.
— 4 files containing the POS tagged version of votes output using Freeling 4.1.
7 Conclusion and Future Work
We have introduced the LiSSS corpus, a new multi-annotated and multi-emotion literary corpus in Spanish. The manual multi-classification have allowed to establish a suitable voting strategy. The results obtained show that the multi-emotion classification of this kind of documents is a very difficult task (for both machines and humans): the low F-score value of annotators in the democratic vote (≈ 51%) seems to confirm it.
We have tested some classical classifiers on the LiSSS corpus. The sentences often belong to two or more classes. The overlap between the sentences of the different classes prevents the systems a better classifying of this literary corpus. We think that automatic classifiers could be enriched through the integration of linguistic and stylistic characteristic or rich representations like word embedding, to achieve a better classification [12, 4, 7, 5]; but this study is out of scope of this paper. The purpose of the LiSSS corpus is to evaluate the efficiently of classification and ML algorithms on a specialized corpus, not to train such-as algorithms.
Future work must be accomplished in order to enrich the corpus with a more important number of sentences and more annotators. The scientific community can contribute to modify or distribute this corpus under the GPL3 license.