1 Introduction
Recognition of textual entailment (RTE) [6] is an important task in the study of Natural Language Processing (NLP). Important role of RTE includes question-answering system where entailment status verified for provided answer in respect of expected answer, multi-document summarization where entailed sentences should be removed, and information retrieval system where only query entailed documents are desirable. However, the classical definition of TE holds two important constraints over the relation. One, the relation is unidirectional where H must entails from T. Else the pair is non-entailed.
Still, in many NLP applications like question answering system or text summarization, both the conditions are equally important to reduce redundant information from the final text. Two, the definition of TE is firm in respect of expressions boundaries such that, the complete meaning of H must be inferred from complete meaning of T. It implies that, if H entails from T and has some additional information or lacks of some information then the relation is a false TE. Since RTE is a binary decision problem, in case of false TE relation, it is quite difficult to decide whether the relation is completely negative or mostly positive. For example, in the following T-H pair, H entails from T but lacks of some information. So, as per the TE definition, the cited pair is non-entailed.
Eng. Gloss: Samond resigned after defeated in Scotland's Independence election.
Eng. Gloss: First minister Samond resigned.
The possible solution of these constraints is partial textual entailment (PTE) relationship which extends TE with partial relation. The work [14] first introduced the concept of partial meaning inferring into texts. Later the idea of PTE introduced by the work [11] where PTE is explicitly defined and explained under the faceted entailment model. In this paper, our contributions are as follows:
We extend PTE by introducing two categories to define partially entailed relation besides classical TE categories. We investigate PTE relation for Indian social media text specifically for Bengalitweets and propose a PTE recognition (RPTE) system based on machine learning algorithm. Our motivation behind the SMT is that the recent interest of the research community as well as industry on SMT such as Twitter in the last few years. Processing of social media text is more challenging than carefully authored traditional text like news text due to their short, noisy, context dependent and dynamic nature. We justified the performance of the proposed approach through experiment results.
We developed a corpus of Bengalitweet pairs and manually annotated by PTE classes. The corpus includes total 5916 numbers of tweet pairs.
The rest of the paper is organized as follows. In the next sections, we discuss the related research work in TE and PTE domain followed by section 3 which formalizes partial entailment categories with appropriate examples. Section 4 reports corpus preparation and section 5 reports proposed approach. Section 6 explains experiment and result followed by the a result analysis in section 7. Finally, we draw the conclusion of our research with future scope of the work in Section 8.
2 Related Work
Our research work is closely related to textual entailment. So, in this section we discuss related and promising research works on PTE as well as TE.
The work [14] first introduced the concept of partial meaning inferring into texts. In the proposed work the reference answer is break down into an ordered pair of words (w1, w2) that are contained in the hypothesis accompanied by a semantic relation. Ordered pair of words is called facet which refer to some part of a text meaning. Then based on the facet overlap between student answers and reference answer every pair is tagged by one of 8 partial meaning labels instead of classical TE decision in the form of yes or no. The annotation labels are: Assumed (facet that are assumed), Expressed (Facet that are directly expressed), Inferred (Facet inferred), Contra-Expr (Facet directly contradicted by negation), Contra-Infr (Facet contradicted by pragmatics), Self-Contra (Facet that are contradicted and implied), Diff-Arg (Facet where core relation expressed) and Un-addressed (Facet not addressed at all).
Partial meaning inferring into text is extends into PTE notion in the work proposed in [11]. Hypothesis breaks down into components and recognizes whether any individual component is entailed by text or non-entailed. The decision of partial entailment is represented by three labels: Exact Match, Lexical Inference, and Syntactic Inference. System labels exact match based on the overlapping of lemmas in the pair. Lexical Inference is based on semantic relatedness of the words in facet and text measured by using WordNet based Resnik Similarity [15]. Whereas, the Syntactic Inference label depends on dependence tree of a given facet, obtained by using Lowest Common Ancestor (LCA) of facet-nodes. The work [9] and [16] have explored PTE for traditional news text and social media text. The work formalizes 4 classes of partial textual entailment and proposes a PTE recognition system based on cosine similarity measure. These research work concludes through experiments that the PTE exploration is a complex task and becomes more complex for SMT than traditional text due to the nature of text.
Vita et al. proposed a PTE recognition system [24] based on Word2vec representation under the faceted entailment model. The system defines two class of partial textual relationship like expressed and un-addressed by mentioning semantic relationship between facet and text.
RTE task is a well defined problem in NLP domain. Since the inception of automatic RTE task [6], different RTE challenges (RTE-1 to RTE-7) have been conducted on different RTE dataset and proposed many state-of-the-art RTE systems. The systems are ranging from lexical similarity methods to complex and extensive linguistic analysis methods including creation of an entailment dataset for heterogeneous domain as well as highly competitive entailment systems.
Lexical-syntactic similarity based approaches use features like n-gram, word similarity, synonym and antonyms [10], word overlap [1] of T and H to determine the existence of entailment. In the work [25]relatedness between T and H measured for RTE task using dependency graphs and Lexical chains. The system proposed by [22], form sequence of transformations for T into H to make resulting T identical to H and decides entailment relation. Various machine learning approaches also introduced to solve RTE problem. Those approaches use various features like WordNet and numeric expressions, syntactic and semantic features [2] and also use different dataset like clinical text corpus [18]. A feature-rich classifier model and a neural network model for TE recognition is proposed by Bowman et al. [4] and evaluated both the model on the Stanford Natural Language Inference (SNLI) corpus. A very Recent work [23]proposed neural network based approach using word embeddings to recognize TE for tweets. The work evaluated the approach by comparing the performance with previous promising machine learning based approaches. Authors report the results for a batch size of 50 over 100 epochs and reports comparable results. Another approach [21] recognizes TE using distributional semantics based navigation algorithm. The approach built a knowledge graph composed of dictionary definitions of terms extracted from a linguistic resource following the work proposed by Silva et al. [20]. The navigation algorithm finds paths between the text and hypothesis in knowledge graph to take decision of entailment.
Our current research work is related to the work proposed in [9,16] respectively. There are two substantial differences between these works and ours. Firstly, earlier works focused only on cosine similarity score between T and H. It is worth noting that, cosine similarity measure does not cover semantic meaning of texts. So, we added more features like semantic similarity and word to vector similarity in our approach for a robust PTE recognition system. Secondly, we focus on machine learning based approach rather statistical approach followed in the earlier proposed approaches.
3 Empirical Definition of Partial Textual Entailment
In this section we define 4 categories of proposed PTE (PTE-I, PTE-II, PTE-III and PTE-IV). Each category is explained along with appropriate examples. During categorization, we break down each T and H into two sections like X T , Y T and X H , Y H respectively to explain partial entailed portion and additional information. We also use braces to identify the boundary of the entailed portion in T and H.
- PTE-I: If the complete meaning of H inferred from complete meaning of T or infers to complete meaning of T then it is a category of PTE-I. This category is the perseverance of the original entailment definition and the relationship is bi-directional. The relation can be expressed as:
H Entails from T
T Entails from H
For example,
Eng. Gloss: 16Th correction bill of constitution has passed, Parliament holds judge removal power.
Eng. Gloss: Constitution correction bill passed to remove judge.
- PTE-II: This category of PTE contains two conditions, like:
Condition I: If H entails from the whole meaning of T and have additional information, then it is a category of PTE-II and represents as,
(X H Entails from T) + Y H
For example,
Eng. Gloss: Kashmir faced the most devastating flood in last 60 years.
Eng. Gloss: Rescue operation by Helicopter and flights started today to face (the most devastating flood condition in Kashmir, India).
Condition II: If T entails from the whole meaning of H and have additional information, then it is also a category of PTE-II and represents as,
(X T Entails from H) + Y T
For example,
Eng Gloss: (Cross fire under ceasefire) in Mariup city of east Ukraine.
Eng. Gloss: Cross fire under ceasefire in Ukraine.
- PTE-III: If a portion of H entails from a portion of T or vice verse, then it is a category of PTE-III and represents as,
(X H Entails from T) + Y H
(X T Entails from H) + Y T
For example,
Eng. Gloss: (Russia supported separatist and government army are in open crossfire under Ukraine ceasefire), one citizen death reported.
Eng. Gloss: (Open crossfire is going on even though ceasefire in Ukraine), industrial city at eastern Ukraine is under threat.
- PTE-IV: If T or H does not entail from H or T, then its a category of PTE-IV and represents as Non-entailed.
For example,
Eng. Gloss: Honorable Supreme court requested all the lawyers of the country to remain alert regarding the opposition of constitution's 16Th correction.
Eng. Gloss: The Law minister advocate told that no views from the educationists and prominent figures regarding constitution's 16th correction will be accepted.
4 Corpus Acquisition
We have prepared a corpus of tweets for Indian social media text specifically for Bengalitweets. We collected Bengali tweets on 10 topics based on recent popular twitter keywords. Our chosen topics covered various information domains like international and national politics, sports, natural disasters, political campaigns, elections etc. In particular, for a period of one month from 1st June to 30th June, 2017, we have collected 31,000 tweets. However, the number of tweets are varies in a range from 1500/topic to 9000/topic.
We filtered out Re-tweets and very short incomplete tweets from the corpus to avoid information redundancy and meaning less tweets respectively. Every pre-processed tweet paired with all the other tweets within the topic. To make number of tweet pair precise and meaning full for annotation task, we followed an automatic filtration process before annotation task. In the following subsection, we discuss the filtration and annotation process in details up to some extent.
4.1 Annotation and Corpus Statistics
The filtration process concentrated on selecting those pairs where possibility of getting PTE-I, PTE-II and PTE-III is higher. So, we calculated cosine similarity of each pair and discarded those having similarity score of 0.30 (in the scale of 0-1) or less. The reason behind choosing this value is that, our manual observation for all the pairs scores reveals that the pairs having score 0.30 or less are mostly non-entailed pairs. Thus, we have chosen that threshold score. After this initial filtration we prepared a corpus of 7000 tweet pairs to proceed for manual annotation task. We employed two human annotators for annotating each pair tweets by one of 4 PTE classes. Annotators are native Bengali speakers and annotated all the tweet pairs. To assess annotation agreement among the annotator, we measured Cohen Kappa coefficient [5] value for all the annotated pairs. Detailed categorical distribution of corpus with annotation agreement is reported in table 1. After complete annotation task, our corpus includes 5916 numbers of annotated tweet pairs for experiment work.
5 System Description
We have formulated RPTE problem as a multi-class classification problem where each instance consists of a pair. We adopt machine learning based approach to build the proposed RPTE system. The system identifies the class of textual entailment based on the various similarity measures of T-H pair at the lexical and shallow syntactic level.
We involved a pre-processing phase before computation of various similarity score. The first step of pre-processing is to divide the input text into units called tokens. Each of these is a word, a number, a punctuation mark and many others. The Bengali POS-Tagger developed by [7] has been used for tokenizing the tweets. Punctuation marks can vary in different computational process. So, we strip out all the punctuation marks from the whole corpus.
5.1 Features
The features used in our classifiers are the different kind of similarity scores for each T-H pair. These features capture the relatedness between T and H. Here, we give brief description of each feature.
5.1.1 Cosine Similarity:
Cosine similarity is widely used in information retrieval to calculate the similarity between documents or sentences. Cosine similarity score represents the relatedness between two n-dimensional vectors. We considered binary vectors with values either 0 or 1 in our computation. Given two vectors of attributes A and B, the cosine similarity is calculated using the equation 1:
Before cosine similarity calculation of each pair, we removed stop words and transformed each word into its root word. The stop/junk words as per proposed list 1 is followed in our work. We develop a Bengali stemmer to transform each Bengali word into its root form.
5.1.2 Semantic Similarity:
Semantic similarity score between T and H is one of the most frequent feature for RTE task [13]. We modeled the semantic similarity of two texts (T,H) as a function of the semantic similarity of the constituent words in both texts. To achieve this objective, our system computed word to word similarity by using the approach proposed in the work [17]. Word similarity is computed based on Bengali WordNet proposed in [8]. Bengali WordNet is a lexical semantic network which holds semantic relations like synonyms and word-senses as the nodes and relations of the synonyms and word-senses are the edges of the network. Word level similarity score accumulated to find sentence level similarity using equation 2:
where the summation of maximum similarity for each word is represents by sim ωi (A,B) and Semantic sim (A, B) is a function that returns semantic similarity score between two tweets. Similarity could not be computed for all words in the text and hypothesis, either because some words, such as proper nouns which do not appear in WordNet, or because some of these metrics can only be calculated if an information content value has already been calculated for the word sense.
An important point is that the similarity value is based on each of the individual word similarity values, so that the overall similarity always reflects the influence of each word and its senses. According to the semantic similarity score formulation, similarity values ranges from 0 to 1.
5.1.3 Word-to-Vector Similarity:
Vector representations of words lend themselves well to express similarity between words. We have used Word2Vec toolkit 2 to determine word-to-vector similarity for each T-H tweet pair. Word2Vec is an efficient implementation of deep learning techniques based on two architectures, continuous bag-of-words (CBOW) and skip-gram (SG) [12]. To measure word2vec similarity of a word, we used word embedding model for Bengali text as proposed in [3]. The trained model is having vocabulary size of 4,36,126 with 300-dimensional vectors and window size of 5.
Every word vector in T is compared with all the word vectors of H to find maximum similar word. Similarity score represents by a score in a range of 0 to 1, stating minimum and maximum similarity.
Word level similarity score accumulated to find tweet level similarity measure using equation 3:
where maxsim(ω i ) is the maximum similarity score of a word in T and N is the total number of unique words in T-H pair.
5.1.4 Uni-gram Similarity:
Each uni-gram in T is searched for a match in the corresponding H part. The measure U sim is calculated as the fraction of the T uni-grams that match in the corresponding H:
where Um is the maximum matching score and n is the number of unique words in both T and H.
5.1.5 Bi-gram Similarity:
Each bi-gram in T is searched for a match in the corresponding H part. The measure B sim is calculated as the fraction of the text bi-grams that match in the corresponding H:
where Bm is the maximum matching score and n is the number of unique words in both the tweets.
5.1.6 Longest Common Subsequence (LCS):
LCS of a T-H pair is the longest sequence of words which is common to both the T and H:
where lcsm is the length of maximum LCS and n is the number of unique words in T-H pair.
5.1.7 Skip-grams Similarity:
A skip-gram is any combination of n words as per occurrence order in the sentence allowing arbitrary gaps. In the present work, only 1-skip-bigrams are considered where 1-skip-bigrams are bigrams with one word gap between two words in order in a sentence. The measure 1-skip-bigram similarity SG sim is defined as in equation 7:
where sgm is the maximum matching score and n is the number of unique words in both the tweets.
6 Experiment and Result
In this section we describe the experiment details along with a comparison to the existing RPTE system.
We have experimented with the four machine learning algorithms like Random forest (RF), Decision tree (DT), Logistic regression (LR) and SMO algorithms. These algorithms were selected because they are known for achieving higher performances for multi-class classification problems as per the literature study of previous research work. We have used WEKA3 machine learning platform [26] to perform our experiments.
We have evaluated the system using cross validation technique with ten folds, testing over the classifiers for four-way classification task. The results are shown in the table 2 below. Feature wise performance of the Classifiers shown in the table. First we asses accuracy performance based on cosine similarity (CS) feature, then based on CS and Lexical similarity features. Lexical similarity feature includes uni-gram, bi-gram, LCS and Skip-grams Similarity, next based on CS, lexical and semantic similarity (SS), finally based on all the features including word 2 vector similarity (W2V). Every classifier shown higher performance by turning-on all features. The best performance is achieved with SMO classifier which is reported as F-measure for PTE-I, PTE-II, PTE-III and PTE-IV as 78.1, 64.6, 75.5 and 98.3 respectively. The difference between the best and the worst classifier performance is 11.5, 9.2, 7.7 and 3.7 for PTE-I, PTE-II, PTE-III and PTE-IV respectively using all features.
PTE type |
Features | RF | DT | LR | SMO |
---|---|---|---|---|---|
I | CS | 72.2 | 63.8 | 69.1 | 76.4 |
+ Lexical | 72.0 | 64.5 | 70.9 | 77.1 | |
+ SS | 72.8 | 65.5 | 70.4 | 77.6 | |
+ W2V | 73.9 | 66.6 | 72.5 | 78.1 | |
II | CS | 60.0 | 46.4 | 57.5 | 61.4 |
+ Lexical | 60.1 | 50.7 | 56.9 | 62.9 | |
+ SS | 60.9 | 53.7 | 57.3 | 63.5 | |
+ W2V | 63.4 | 55.4 | 58.6 | 64.6 | |
III | CS | 72.9 | 65.4 | 66.6 | 74.8 |
+ Lexical | 72.5 | 66.1 | 68.3 | 75.4 | |
+ SS | 74.6 | 68.8 | 68.2 | 75.7 | |
+ W2V | 75.3 | 67.8 | 69.5 | 75.5 | |
IV | CS | 93.3 | 94.6 | 94.6 | 97.6 |
+ Lexical | 94.0 | 94.7 | 93.8 | 97.6 | |
+ SS | 95.0 | 94.6 | 94.9 | 98.3 | |
+ W2V | 95.2 | 94.6 | 94.9 | 98.3 |
We also compared our proposed approach with the approach proposed in [16] (baseline system) on our dataset. The performance result in terms of precision (pre.), recall (rec.) and F-measure is shown in table 3.
7 Result Analysis
Despite the fact that the results are showing the average accuracy, we believe that the obtained results are very promising. This is because of the fact that the PTE task is more challenging than classical TE relation. More specifically, identification of PTE-II and PTE-III is more difficult than PTE-I and PTE-IV. System recognizes PTE-II and PTE-III with accuracy of F measure 64.6 and 75.5 respectively.
The reason behind this average performance may be due to the fact that the more additional information in T or H proportionally decreased similarity scores which tends to false class identification. For example:
Eng. Gloss: Scotland will be with United Kingdom.
Eng. Gloss: (United Kingdom is united), at the end of all prediction, finally reported that 55% voters voted against independence of Scotland.
In above example, in H additional information is much more than entailed information. This large amount of additional information returns very poor similarity values which recognize the pair as non-entailed even though the pair is an example of PTE-II. Similar kind of identification difficulties arises for PTE-III also.
The accuracy of PTE-I and PTE-IV recognition is under the average accuracy of state-of-the-art RTE system. The main reason of this average accuracy may be due to the nature of social media text which is completely different than traditional text and pose various challenges towards processing. In many instances it is found that traditional cosine similarity score is not representing actual closeness between expressions. Soft cosine measure [19] may represents relatedness more accurately, which is in our future scope of the current work. Our approach outperforms to identify PTE-IV with maximum accuracy of F measure 98.3 higher than the other category of PTE. This is usual as most of the non-entailed pairs similarity scores remain in a specific lower range.
Table 2 shows that the performance of the classifiers are gained when more features are added. Significant difference in the performance is found when both the syntactical and semantic features are used instead of using only syntactical features. We also found that the classifier gains more accuracy by using all the features.
8 Conclusion and Future Work
Exploring various classes of partial entailment for Indian social media text is the core contribution of this task. We proposed an automatic RPTE approach for Bengali Social media text where information is more challenging in nature due to the unique style of writing. We have conducted experiments and showed that our proposed approach outperform over the performances of previously introduced approaches.
Our future scope of work is to make the system more robust and applicable for code-mixed social media text. Unicode Bengali tweets are less noisy in nature with compare to the code-mixed tweets. So, RPTE for this text genre would be more difficult.