1 Introduction
During recent years, there have been a lot of research in the area of Natural Language Processing (NLP) related to sentiment analysis [12, 13, 11, 10].
Stance detection can be viewed as a subtask of opinion mining, similar to sentiment analysis. In sentiment analysis, systems determine whether a piece of text is positive, negative, or neutral. Stance detection goes even further and tries to detect whether the author of the text is in favor or against a given target. The main difference to sentiment analysis is that in stance detection, systems are to determine the author’s favorability towards a given target and the target may not even be explicitly mentioned in the text. Moreover, the text may express positive opinion about an entity contained in the text, but one can also infer that the author is against the defined target (an entity or a topic). It has been found difficult to infer stance towards a target of interest from tweets that express opinion towards another entity [8].
There are many applications which could benefit from the automatic stance detection, including information retrieval, textual entailment, or text summarization, in particular opinion summarization.
The same stance towards a target may be expressed by positive or negative language. This phenomenon has not yet been thoroughly investigated. The pioneer work in English Tweets [9] annotated stance dataset with additional sentiment labels and show that knowing the sentiment label is beneficial for stance detection, however they also state that “even though sentiment can play a key role in detecting stance, sentiment alone is not sufficient”.
Our goal is to examine how stance and sentiment influence each other in Czech language and either confirm or reject the hypothesis that sentiment labels are beneficial for stance detection.
The rest of this paper is organized as follows. Section 2 presents the related work. The dataset is described in Section 3. The annotation of sentiment is covered in Section 4. Our approach is presented in Section 5.
Conducted experiments are described in Section 6. Finally, we conclude in Section 7.
2 Related Work
The SemEval-2016 task Detecting Stance in Tweets1 [8] had two subtasks: supervised and weakly supervised stance identification.
The goal of both subtasks was to classify tweets into three classes (In favor, Against, and Neither). The performance was measured by macro-averaged F1-score of two classes (In favor and Against) denoted F1ma2 and by micro-averaged F1-score for the same two classes denoted F1mi2. This evaluation measure does not disregard the Neither class, because falsely labelling the Neither class as In favor or Against still affects the scores. We use the same evaluation metrics F1ma2, accuracy, and the F1-score of all classes (F1ma3).
The supervised task (subtask A) tested stance towards five targets: Atheism, Climate Change is a Real Concern, Feminist Movement, Hillary Clinton, and Legalization of Abortion. Participants were provided with 2814 labeled training tweets for the five targets.
A detailed distribution of stances for each target is given in Table 1. The distribution is not uniform and there is always a preference towards a certain stance. The distribution reflects the real-world scenario, in which a majority of people tend to take a similar stance [2].
Target Entity | Total | In favor | Against | Neither |
---|---|---|---|---|
Atheism | 733 | 124 (17%) | 464 (63%) | 145 (20%) |
Climate Change is Concern | 564 | 335 (59%) | 26 (5%) | 203 (36%) |
Feminist Movement | 949 | 268 (28%) | 511 (54%) | 170 (18%) |
Hillary Clinton | 934 | 157 (17%) | 533 (57%) | 244 (26%) |
Legalization of Abortion | 883 | 151 (17%) | 523 (59%) | 209 (24%) |
All | 4,063 | 1,035 (25%) | 2,057 (51%) | 971 (24%) |
For the weakly supervised task (subtask B), there were no labeled training data but participants could use a large number of tweets related to the single target: Donald Trump.
The best results (F1ma2 56.0%, F1mi2 67.8%) for subtask A were achieved by an advanced baseline using SVM classifier with unigrams, bigrams, and trigrams along with character n-grams (2, 3, 4, and 5-gram) as features.
Wei et al. [15] present the best result for subtask B and they ranked close second in subtask A of the SemEval stance detection task. They used a convolutional neural network (CNN) designed according to Kim [4].
They initialized the embedding layer with pre-trained word2vec embeddings. The main difference from Kim’s network is the used voting scheme. During each training epoch, several iterations were selected to predict the test set. At the end of each epoch, the majority voting scheme was applied to determine the label for each sentence. This was done over a specified number of epochs and finally the same voting was applied to the results of each epoch. The train and test data were separated according to the stance targets.
Mohammad et al. [9] annotated the SemEval-2016 task Detecting Stance in Tweets dataset [8] with sentiment labels and whether the opinion is expressed towards the given stance target. They performed a detailed analysis of the dataset and conducted several experiments. They showed that sentiment label is beneficial for stance detection however it is not sufficient (F1ma2 56.1%, F1mi2 59.6%).
2.1 Stance Detection in Czech
The initial research on Czech stance detection has been done by Krejzl et al. [6]. They collected 1,460 comments from a Czech news server2 related to two topics - Czech president - “Miloš Zeman” (181 In favor, 165 Against, and 301 Neither) and “Smoking Ban in Restaurants” (168 In favor, 252 Against, and 393 Neither).
Hercig et al. [2] extended the dataset from Krejzl et al. [6]. The detailed annotation procedure was described in [3] (in Czech). The whole corpus was annotated by three native speakers. The distribution of stances for each target is given in Table 2. They evaluated Maximum Entropy, SVM and two CNN classifiers. We used the Czech president - “Miloš Zeman” dataset3 to annotate Czech stance detection corpus with sentiment labels. We chose this dataset because of its size and better inter-annotator agreement. The best results for this dataset were achieved by the CNN designed according to Kim [4] and the Maximum Entropy classifier.
3 Dataset
The dataset for the target entity “Miloš Zeman” was annotated by one annotator and then 302 comments were also labeled by a second annotator to measure inter-annotator agreement. The dataset for the target entity “Smoking Ban in Restaurants” was independently annotated by two annotators (2,203 comments) and then the majority voting scheme was applied to the gold label selection (third annotator was used to resolve conflicts). The inter-annotator agreement (Cohen’s κ) is 0.579 for “Miloš Zeman” and 0.423 for “Smoking Ban in Restaurants”.
The inter-annotator agreement for “Smoking Ban in Restaurants” was quite low, thus they selected a subset of the dataset, where the original two annotators assigned the same label as the gold dataset (1,388 comments).
4 Annotation
We annotated the Czech president - “Miloš Zeman” stance detection dataset with sentiment labels (positive, negative, and neutral).
The whole dataset was annotated by one annotator and then a second annotator was used to calculate inter-annotator agreement (Cohen’s κ) on 131 comments. The annotators should assign the strongest sentiment to each comment or neutral label when the comment is factual (non-subjective) without anticipating further information (context). The inter-annotator (Cohen’s κ) is 0.524% (see the confusion matrix Table 4) and accuracy is 71.8%.
Sentiment/Stance | In Favor | Against | Neither | SUM | ||||
---|---|---|---|---|---|---|---|---|
Positive | 164 | (6.2%) | 43 | (1.6%) | 20 | (0.8%) | 227 | (8.6%) |
Negative | 116 | (4.4%) | 614 | (23.3%) | 83 | (3.1%) | 813 | (30.8%) |
Neutral | 411 | (15.6%) | 606 | (23.0%) | 581 | (22.0%) | 1598 | (60.6%) |
SUM | 691 | (26.2%) | 1263 | (47.9%) | 684 | (25.9%) | 2638 | (100%) |
Table 3 shows the distribution of sentiment and stance labels in the extended dataset. While most comments are against the target, the sentiment of most comments is neutral and only a small portion of the dataset is positive. Most of the comments that are in favor of the target are neutral which means that the comments are non-subjective, however the comments against the target are mostly negative and almost none is positive. The comments neither for nor against the target are mostly neutral as expected. For positive sentiment the comment is mostly in favor of target. Negative sentiment most of the time means against the target and neutral sentiment is almost uniformly distributed across stance labels.
We also labeled the comments for the presence of the “Miloš Zeman” entity and the “president” entity. The distribution of entities by stance and sentiment labels is shown in Table 5. The presence of these entities was detected by regular expressions4.
(a) Presence of Entities by Stance | (b) Presence of Entities by Sentiment | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Entity | Miloš Zeman | President | Entity | Miloš Zeman | President | |||||
Present | True | False | True | False | Present | True | False | True | False | |
In Favor | 364 | 327 | 187 | 504 | Positive | 130 | 97 | 69 | 158 | |
Against | 688 | 575 | 333 | 930 | Negative | 412 | 401 | 216 | 597 | |
Neither | 435 | 249 | 212 | 472 | Neutral | 945 | 653 | 447 | 1151 |
The extended corpus annotated with sentiment labels and marked for the presence of entities “Miloš Zeman” and “president” is available for research purposes at http://nlp.kiv.zcu.cz/ research/sentiment#stance.
5 The Approach Overview
For all experiments we use Maximum Entropy classifier from Brainy machine learning library [5]. We evaluate using 20-fold cross-validation to allow comparison with previous work [2].
5.1 Preprocessing
We use UDPipe [14] with Czech Universal Dependencies 1.2 models for tokenization, POS tagging, and lemmatization. We further use lower-casing, remove diacritics, and we also replace all characters “y” with the character “i”.
5.2 Features
This section describes features used in our experiments.
— Character n-grams (ChNn): Separate binary feature for each character n-gram in the utterance text. We do it separately for different orders n ∈ {5, 7} and remove n-grams with frequency f ≤ 2.
— First Words (FW): Bag of first five words with at least 2 occurrences.
— Last Words (LW): Bag of last five words with at least 2 occurrences.
— Emoticons (E): We used a list of negative emoticons5 specific to the news commentaries source. The feature captures the presence of an emoticon within the text.
— Unigram Shape (Sh): The occurrence of word shape unigram in the text. Word shape assigns words into one of 24 classes6 similar to the function specified in [1]. We consider unigrams with frequency f > 2.
— Target (TP): One-hot vector for gold labels of the other task (e.g. sentiment label for stance detection) combined with the presence of the “president” entity (the resulting vector has length 6).
— Target (TZ): One-hot vector for gold labels of the other task (e.g. sentiment label for stance detection) combined with the presence of the “Miloš Zeman” entity (the resulting vector has length 6).
— Text Length (TL): We map the text length into a one-hot vector with length three and use this vector as binary features for the classifier. The text length belongs to one of three equal-frequency bins7. Each bin corresponds to a position in the vector.
— Oracle (O): One-hot vector for gold labels of the other task (e.g. sentiment label for stance detection).
— Word n-grams (WNn): Separate binary feature for each word n-gram in the utterance text. We do it separately for different orders n ∈ {1, 2, 3} and remove n-grams with frequency f ≤ 2.
6 Experiments
For all experiments we report the macro-averaged F1-score of two classes F1ma2 (In favor and Against) - the official metric for the SemEval-2016 stance detection task[8], accuracy, and the macro-averaged F1-score of all three classes (F1ma3).
Table 6 shows results of all our experiments. We performed experiments with using the gold sentiment labels as features for stance detection and using the gold stance labels as features for sentiment analysis (i.e. using the Oracle feature). The results show that the Oracle feature improves results in all cases. The Oracle feature combined with unigrams and character n-grams also outperforms the previous state-of-the-art results for stance detection by 3.0% F1ma3, 2.6% F1ma2, and 2.2% Acc.
Features | Stance | Sentiment | ||||
---|---|---|---|---|---|---|
F1ma3 | F1ma2 | Acc | F1ma3 | F1ma2 | Acc | |
Random Class | 32.1 | 33.4 | 32.9 | 29.6 | 23.1 | 33.2 |
Majority Class | 21.6 | 32.4 | 47.9 | 25.1 | 00.0 | 60.6 |
Best results from Hercig et al. [2] | 51.3 | 56.4 | 54.9 | - | - | - |
O | 34.0 | 51.1 | 52.5 | 36.7 | 21.9 | 56.2 |
WN1 | 48.1 | 52.0 | 50.6 | 55.1 | 47.5 | 60.9 |
WN1 + O | 51.7 | 56.2 | 54.3 | 59.1 | 52.4 | 64.3 |
WN1 + TP | 50.7 | 55.1 | 53.4 | 58.7 | 51.9 | 64.2 |
WN1 + TZ | 51.5 | 55.8 | 54.1 | 58.9 | 52.2 | 64.0 |
WN1 + TP + TZ | 51.5 | 55.9 | 54.2 | 59.1 | 52.3 | 64.4 |
WN1 + ChN5,7 | 50.3 | 55.2 | 53.9 | 56.4 | 47.1 | 65.1 |
WN1 + ChN5,7 + O | 54.3 | 59.0 | 57.1 | 58.8 | 50.2 | 67.4 |
WN1 + WN2,3 | 50.8 | 55.8 | 53.9 | 57.6 | 49.8 | 64.1 |
WN1 + WN2,3 + O | 53.7 | 58.5 | 56.6 | 59.9 | 52.8 | 65.7 |
Feature set* | 54.2 | 58.8 | 57.3 | 60.1 | 51.8 | 68.3 |
Feature set - ChN5,7 | 54.3 | 58.4 | 57.6 | 61.3 | 54.4 | 67.2 |
Feature set - E | 54.4 | 58.9 | 57.4 | 59.7 | 51.3 | 68.2 |
Feature set - FW | 54.8 | 59.2 | 57.8 | 60.4 | 52.3 | 68.3 |
Feature set - LW | 54.5 | 58.9 | 57.5 | 58.7 | 49.8 | 67.8 |
Feature set - TL | 54.2 | 59.1 | 57.4 | 59.7 | 51.3 | 68.0 |
Feature set - Sh | 54.2 | 58.8 | 57.3 | 59.0 | 50.5 | 67.4 |
Feature set - WN1,2,3 | 54.5 | 58.5 | 57.4 | 58.2 | 49.4 | 67.1 |
Feature set - O | 54.0 | 58.7 | 57.2 | 60.3 | 52.0 | 68.4 |
Feature set - TP | 54.3 | 58.9 | 57.5 | 60.0 | 51.8 | 68.2 |
Feature set - TZ | 54.2 | 58.8 | 57.4 | 60.0 | 51.7 | 68.0 |
Best combination† Stance | 56.2 | 60.3 | 59.1 | 59.4 | 51.0 | 67.7 |
Best combination‡ Sentiment | 54.8 | 58.9 | 57.7 | 62.0 | 54.6 | 68.9 |
* ChN5,7 + E + FW + LW + TL + Sh + WN1,2,3 + O + TP + TZ
† ChN7 + E + Sh + WN1 + O + TP + TZ
‡ ChN5 + E + LW + TL + Sh + WN1,2,3 + O + TP + TZ
Another experiment included using features that indicate the presence of the “Miloš Zeman” entity and the “president” entity combined with the gold labels as in Oracle feature. Our expectation was that this should improve the results (as it did in English), however the results show that in fact the information about the presence of the target entity does not lead to better results.
We further performed an ablation study for the combination of features (ChN5,7 + E + FW + LW + TL + Sh + WN1,2,3 + O + TP + TZ). In Table 6 the bold numbers denote the best results for the given column.
The ablation study shows that the FW feature present little to no information gain for the classifier. We further experimented with combinations of features and that lead to the best feature sets for both stance detection and sentiment analysis (see the last two lines in Table 6). Both of these sets contain emoticons, word shape, oracle and target entities.
7 Conclusion
We presented the first Czech dataset annotated for both stance and sentiment labels including the presence of target entities. We have shown that stance and sentiment can be mutually beneficial and confirmed our initial hypothesis. Moreover, we have outperformed the state-of-the-art results for stance detection in Czech and set a new state-of-the-art results for the sentiment analysis part of the dataset.
Our best result outperformed the previous stance detection state of the art by 4.9% F1ma3, 3.9% F1ma2, and 4.2% Acc. The sentiment analysis unigram baseline was outperformed by 6.9% F1ma3, 7.1% F1ma2, and 8.0% Acc.
In the future we plan to extend this analysis on other target entities and explore the usefulness of labels assigned by trained models instead of using gold labels for the Oracle feature.