1 Introduction
The determination of sentence similarity in natural language processing has a wide range of applications. In applications like Chatbots, the uses of sentence similarity include estimating the semantic meaning between the user input and button text. Hence, such applications need to have a robust algorithm to estimate the sentence similarity which can be used across a variety of domains. Well, the main reason we want to infer meaning from raw text is that NLU aims at building systems thatunderstand user utterance and trigger meaningful results based on the user input. Refer Figure 1 for example.
Multitask learning [13] schemes along with supervised and unsupervised approaches have lead to the betterment of NLP task results.
A simple approach [6] using WMD (Word Mover's Distance), which measures the dissimilarity between two sentences, as the minimum distance that the embedded words of a sentence need to travel to reach the embedded words of other. The recent approach [14] to sentence level semantic similarity technique is based on unsupervised learning from conversational data. This approach process the sentences in a high dimensional space and doesn't fetch better results on short sentences, so it's very hard to learn direct Sentence Embeddings. Also, the most recent Sentence Encoder models [4], Transformer encoder and Deep Averaging Network (DAN) have a trade-off of accuracy and computational resource requirement. Moreover, one needs to build the deep neural networks (DNN) or more sophisticated architectures and train the model with the large corpus.
Here, we propose methods which are based on Cosine similarity calculation along with Sliding window and Weighted N-gram. The proposed approach is fairly simple in architecture and outperforms the latest Universal Sentence Encoder technique [4].
2 Related Work
Sentence similarity has many interesting applications such conversational agent with script strategies [1] and the Internet. The recent work in the area of natural language processing has contributed valuable solutions to calculate the semantic similarity between words and sentences. However, much research has been done on measuring long text similarity, the computation of sentence similarity is far from perfect [7,5,8]. We propose to compute sentence similarity between a very short (1-3 words) and lengthy sentences. Bag of word cosine similarity does not take care of word order in a sentence. For example, "Do I not look good?" and "I do not look good." will have a 100% cosine similarity score. For document similarity, weighted N-Gram over cosine similarity is being suggested in 3.2.2. We took N-Gram weighting formula from the paper [3].
The use of unsupervised word embedding representation of words as vectors, is to preserve semantic information [10]. The Wordwise sum of vectors or average of the vectors also produces a vector with the potential to encode meaning. The mean was used as baseline in [11]. The sum of word embeddings first considered in [10] for short phrases, was found to be an effective model for summarization in [9].
The cosine distance, as is commonly used when comparing distances between embeddings, is invariant between sum and mean of word embeddings. Both sum and mean of word embeddings are computationally inexpensive, given the fact that pre-trained word embeddings are available. Deep learning solutions [12] handle sentence similarity with variable-length but, requires a huge chunk of data to train and is resource heavy to train and maintain.
3 Model Architecture
The proposed methodologies use Word Embeddings and Cosine similarity techniques for word representation and calculating similarity score.
3.1 Word Embedding and Cosine Stacks
Word Embeddings. Word embeddings computed using diverse methods are basic building blocks for Natural Language Processing (NLP) and Information Retrieval (IR). They capture the similarities between words [2]. And as our approach is naturally dependent on a word embedding, we've chosen FastText [3] over other embeddings. Firstly, subword information is taken into consideration in which each word w is represented as a bag of character N-gram. This further signifies that, for previously unseen words (e.g. due to typos), the model can make an educated guess towards its meaning, thus allowing to learn reliable representation for rare words. Inherently, this also allows you to capture meaning for suffixes/prefixes. Second, and most importantly, we notice that the proposed approach provides very good word vectors even when using small training datasets.
Cosine Similarity. The cosine similarity between two vectors (or two sentences on the Vector Space) is a measure to calculate the cosine of the angle between them. This metric is a measurement of orientation and not magnitude. It can be seen as a comparison between sentences on a normalized space because, we're not only taking into consideration the magnitude of each word count (tf-idf) of each document, but also the angle between the sentences. To obtain the equation for cosine similarity, we simply rearrange the equation of dot product between two vectors.
3.2.1 Sliding Window with Average Weighted Word Vectors
In language, the meaning of the sentence is reflected by the words in it. Older methods used the weighted average of word embedding to represent the sentence and cosine similarity. But, as we are comparing the similarity between short and long sentences, doing the weighted average on a long sentence doesn't help. Moreover, it reduces the weight of the main action verb in the overall representation, which in turn affects the sentence similarity. To overcome this, we use the sliding window approach (Fig. 2) on a long sentence, so that the main action verb weight will be the same in both inputs.
After applying sliding window on S2, we get a list of substrings S2'. For vector representation of every window, we iterate through the S2' and take the weighted average of word embedding, to find the cosine similarity with S1. The final similarity score for S1 and S2 is taken as the maximum score, obtained from the window comparisons. In Chatbot application, False Positive must be very less for better user experience. We tried the weighted N-gram approach to further reduce false positives.
3.2.2 Weighted N-gram Vectors
N-grams are consecutive strings of N words, for example, trigrams are all possible three word long substrings of a given sentence. To compare two sentences, the sentences are tokenized into unigram, bigram and trigram.
For every unigram of sentence S1, find similarity with every unigram of sentence S2 and select the maximum score as match score for that unigram. All the selected unigram scores are averaged over to get a final unigram score:
Likewise, for every bigram of S1, find similarity with every bigram of S2 and select maximum score as match for that bigram. All the selected bigram scores are averaged over to get a final bigram score:
The final similarity score of the sentences is taken as the weighted sum of the final similarity scores of unigrams, bigrams and trigrams:
where
As discussed in section 3.1, we used cosine similarity on averaged word embedding to calculate similarity between N-grams.
4 Results
Here, we describe the data set, which is a conversational data found in Chabot builder based NLP engine environment. We then compare the 3.1 and 3.2 sections with latest Google's Universal Sentence encoder based sentence similarity approach.
4.1 Dataset
Although, many datasets are accessible, there are currently no suitable benchmarks (or even standard text sets) for the evaluation of similarity between long and short sentences. We release a dataset 1 which is very specific to conversational agents problem statement. Here, the dataset has been structured into two columns, first, the long sentence which imitates user input and second, the short sentence which typically resembles the button text in the chat conversation.
4.2 Sentence Similarity
A testing instance is a pair of button text and user input. The similarity score between each user input and button text is calculated. Based on similarity score, comparison is categorized as positive or negative. Comparison between button text and user input is deemed positive, if the similarity score is above threshold (0.9). Similarly, the comparison between button text and user input is deemed negative, if the similarity score is below the threshold (0.9). We used the performance metric precision, F1 Score and recall for evaluating our solution.
Our model outperformed Google's sentence similarity in F1 and Recall, see Table 2.
Approaches/ Metrics |
Google Universal Sentence Encoder |
Sliding Window with avg. Weighted Vectors |
Weighted N-gram Vectors |
---|---|---|---|
Recall | 0.0789 | 0.2593 | 0.9408 |
Precision | 0.9022 | 0.6507 | 0.9226 |
F1 Score | 0.1451 | 0.3708 | 0.9316 |
Accuracy | 0.9256 | 0.9298 | 0.9880 |
5 Conclusion
In the development stages of Chatbots, the current bot platforms provided ML solutions and required large training data from developers. And, the platform had to manage multiple data perpetually and the process became complex and expensive to train the model every time.
In this paper, we propose the sliding window with average weighted word vectors and Weighted N-gram vectors for developing the input semantics vector. The proposed method replaces the sentence embedding approach with simple word embedding based sentence representation and also it doesn't need large dataset for training.
We are excited about the execution of our approaches and will apply the same to other text classification tasks in the near future. We plan to improve the word representation using dependency and constituency parsing information and also, to apply other vector Similarity method than cosine, for the betterment of results.