Text Classification using Gated Fusion of n-gram Features and Semantic Features

Nagar, Ajay; Bhasin, Anmol; Mathur, Gaurav; Nagar, Ajay; Bhasin, Anmol; Mathur, Gaurav

doi:10.13053/cys-23-3-3278

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Comp. y Sist. vol.23 no.3 Ciudad de México jul./sep. 2019 Epub 09-Ago-2021

https://doi.org/10.13053/cys-23-3-3278

Articles of the Thematic Issue

Text Classification using Gated Fusion of n-gram Features and Semantic Features

Ajay Nagar^{^*}¹^**

Anmol Bhasin^{^*}¹

Gaurav Mathur¹

^¹ Samsung R&D Institute India, Bangalore, India. ajay.nagar@samsung.com, anmol.bhasin@samsung.com, gaurav.m4@samsung.com

Abstract

We introduce a novel method for text classification based on gated fusion of n-gram features and semantic features of the text. The parallel CNN network captures the n-gram relation between the words based on the filter size, primarily short distance multiword relations. Whereas for semantic relationship, universal sentence encoder or BiLSTM is used. Gated fusion is used to combine n-gram and semantic features. The model is evaluated on 4 commonly used benchmark datasets (MR, TREC, AG-News and SUBJ), which includes sentiment analysis and question classification. The proposed method is able to surpass the existing state-of-the-art DNN architectures for text classification on these datasets.

Keywords: Text classification; convolutional neural network; universal sentence encoder; BiLSTM

1 Introduction

Deep learning models have revealed amazing results in numerous Natural Language Processing(NLP) tasks such as Neural Machine Translation (NMT) ^[¹^], Named Entity Recognition (NER) ^[²^], Text Summarization ^[³^], Text Classification ^[⁴^] etc. Among these, text classification is one of the important and challenging task in NLP, which aims to assign predefined relevant categories to natural language texts. It is useful in many applications like social media text analysis, sentiment analysis applications, business analysis applications, feedback analysis applications etc. Since, there is no complete set of predefined rules for natural languages, classification algorithms are unable to capture complex semantics of the text.

1.1 Prior Work

Feature representation for text classification is a crucial problem. Initially, bag-of-words model, which uses unigrams, bigrams, n-grams, were used for feature representation. Later, Mikolov et al. ^[⁵^] proposed distributed representation of words to solve the data sparsity problem and loss of semantic information of words. Character embedding and sentence embedding are the other types of embedding used for text classification. Word2vec ^[⁶^] and GloVe ^[⁷^] are two pre-trained word embedding commonly used for text classification.

Different deep learning architectures has been applied for text classification to learn different features. Socher et al. ^[⁸^] proposed the Recursive Neural Network for text classification by modelling sentence representation. Since, text classification problem has sequential nature, Recurrent Neural Networks (RNN) and its variants Gated Recurrent Unit (GRU), Long Short Term Memory (LSTM) were used to learn long distance dependencies or semantics of text. Yoon Kim ^[⁴^] proposed Convolution Neural Network (CNN) for text classification using pre-trained word embedding as input to learn n-gram features. CNN captures local correlations of spatial or temporal structures but loses the context of text. To take the advantage of both models, Siwei et al., ^[⁹^] proposed recurrent convolution neural network for text classification to first provide context to each word using RNN and then use CNN to find n-gram features. Zhou et al., ^[¹⁰^] also proposed C-LSTM neural network for text classification by applying CNN first and then LSTM.

Since both architectures used CNN and RNN sequentially, error in CNN network is also propagated to RNN and vice versa. Sequential nature of these architectures may lead to erroneous n-gram features, semantic features or long distance dependencies.

1.1 About Our Work

To address the above problem, we propose a novel architecture for text classification using the gated fusion of n-gram features and semantic features.

The intention of this work is robust text classification by modelling n-gram features and long distance dependencies or semantics of text (representation of sentence as a embedding) more concretely. Figure 1 represents higher-level block diagram of our proposed network.

Fig.1. Higher Level Block Diagram

2 Proposed Models

As discussed in previous section, both n-gram features and semantic features were considered for classification by taking the advantage of CNN, LSTM and Sentence Encoder. To avoid error propagation from n-gram features extraction to semantic features or vice-a-versa, we trained both networks independently. The architecture loss was minimized based on summation of loss of two models. The CNN was used to capture n-gram features. For long distance dependencies or semantic features, in one model, we used Bidirectional LSTM and in another, we used universal sentence encoder. For fusing both model outputs, we used gated fusion equation. Details of two models are discussed in following section.

2.1 Model 1: Gated Fusion on CNN and Bidirectional LSTM

The basic idea is to process the input sentence in two parallel network as shown in the Figure 2. For this, first we tokenize the sentence by splitting at space to get the word sequence. To represent words as distributed dense vectors, we used L1 dimensional GloVe ^[⁷^] pre-trained word embedding. Unseen words (words not present in GloVe word embedding) were represented as dense vector using uniform random initialization of L1 dimension. For length of input sequence average sentence length L2 was used. In this way word embedding matrix of dimension L2*L1 for a input sentence was created.

Fig. 2 Architecture of using Gated Fusion Equation on CNN and Universal Sentence Encoder

Fig. 3 Architecture of using Gated Fusion Equation on CNN and Bidirectional LSTM

For extracting n-gram features, we used Kim's CNN model ^[⁴^] as baseline model. Word embedding matrix of a sentence was passed to 4 parallel CNN layers having different filter sizes f1*L1, f2*L1, f3*L1 and f4*L1. The filter height was set to f1=1, f2=2, f3=3 and f4=5, where height represents number of words to be convolved to capture unigram, bigram and n-grams.

128 filters of each type were taken. After the convolution layer max pooling layer was used to compute most important feature from the output of every convolution. We got 128 features from each convolution layer. These 128 features from every layer were concatenated and a dense vector of size 512 was obtained which constitute n-gram features of a sentence.

Since, LSTMs are able to captures the long distance dependencies in sequential data, we fed word embedding matrix of a sentence to bidirectional LSTM layer to model long distance dependencies or semantic features. We utilized bidirectional LSTM to provide forward and reverse context of the text to the network. We considered 256 hidden units in LSTM. The output of this layer was a dense vector of size 512 (concatenated output of forward LSTM and backward LSTM) which can be interpreted as semantics (long distance dependences) of complete text.

The CNN model output and LSTM model output was given as input to gated fusion equation (Z). The output of gated fusion equation was passed to dropout layer and then to fully connected layer and finally, a softmax layer was used for classification:

Z=t⨀gWHy11+bH1 +1-t⨀gWHy22+bH2 , (1)

t=σWTy+bT. (2)

Equation (1) represents gated fusion equation, where g is a nonlinear activation function and for our experiment, we used it as 'relu'. Equation 2, i.e., 't', is called the weightage gate. It represents the weightage given to n-gram features and (1-t) represents the weightage given to long distance dependencies or semantic features of text. The features generated by the CNN layer and the bidirectional LSTM layer are averaged to generate 'y'. We generated weights by applying sigmoid to 'y' and that are learnable. The intention of using gated fusion was to make model learn to choose itself between features generated by CNN and Bi-LSTM. Since, some texts can be best classified based on short-term dependences and some can be best classified based on long distance dependences, so for model to itself decide the weightage for both these dependences we used gated fusion.

For n-gram features, first, we construct the word embedding matrix using input sentence as described for Model 1 and then applied convolution sequence. Kernel sizes, number of kernels, pooling and output dimensions are same as previous model (Model 1).

Similar to previous model, features generated by both the CNN layer and the universal sentence encoder are averaged to obtain y (Equation 1). The CNN network output and Sentence Encoder output was passed to gated fusion equation (similar to Model 1).

The output of fusion layer was then passed to fully connected layer and finally a softmax layer was used to predict the class.layer, max-pooling layer and flatten layer in Datasets and Experiment Details.

In this section, we first present the benchmark datasets we used for our experiments and then experiment details.

2.2 Datasets

We carried out our experiments with 4 benchmark datasets, Table 1 shows the distribution of training and testing data for all four datasets along with number of classes to be predicted.

Table 1 Benchmark datasets

Dataset	Train Data	Test Data	Classes
MR	10662	CV	2
TREC	5952	500	6
AG News	120000	7600	4
SUBJ	10000	CV	2

MR: Positive and Negative movie reviews dataset. Aim is to identify a movie review, positive or negative ^[¹²^]. There is no test dataset defined, so fivefold cross validation (CV) was done.

TREC: This dataset contains 6 types of questions. Objective is to identify the class for given question ^[¹³^].

AG News: Topic Classification Dataset. Aim is to classify news into different classes ^[¹⁴^].

SUBJ: This dataset has sentences and task is to classify a sentence into objective or subjective type ^[¹⁵^]. There is no test dataset defined, so for this also fivefold cross validation (CV) was done.

2.3 Experiment Details

For all the datasets, training was carried out using mini batch gradient descent with batch size of 64. For binary classification 'binary cross entropy' was used and for other datasets 'categorical cross entropy' loss was being used. We used 'Adam' as an optimizer. No of epoch for each model was taken as 50, though for all the datasets the model converged well below 50 epochs. We used 0.5 dropout rate to reduce overfitting of data.

3 Datasets and Experiment Details

In this section, we first present our state-of-the-art results on datasets we used for our experiments and then result discussion.

3.1 Experiment Details

We evaluated the proposed methods on different benchmark datasets and compared with state-of-the-art results to show effectiveness of the method. Results of both the models are shown in Table 2. We evaluated the datasets on basis of overall accuracy i.e. correct predictions divided by total number of predictions for test dataset. 5 fold cross-validation is used for cross validation datasets. We observed that using both n-gram features and features generated by universal sentence encoder are able to achieve remarkable results. Using these features, we are able to surpass all the results achieved by multichannel CNN model. Compared to C-LSTM model, we got 1.2% accuracy improvement on TREC dataset. Using n-gram features and features generated by RNN, we are able to achieve approximately same results to other models.

Table 2 Accuracy on different datasets achieved by our models

Models	Dataset
Models	MR	AG News	TREC	SUBJ
Yoon Kim ^[⁴^]	81.5	86.1	93.6	93.4
CharCNN ^[¹⁴^]	77	78.3	76	—
WCCNN ^[¹⁶^]	83.8	85.6	91.2	—
KPCN ^[¹⁶^]	83.3	88.4	93.5	—
C-LSTM ^[¹⁰^]	—	—	94.6	—
BiLSTM-CRF	82.3	—	—	—
F-Dropout ^[⁴^]	79.1	—	—	93.6
Model 1	79.71¹ 80.39²	88.26	93.8	92.611 93.62
Model 2	83.4¹ 84.43²	88.75	95.8	94.951 95.852

3.2 Discussion

It is observed that with the gated fusion of n-gram features by CNN and features generated by RNN, model was able to achieve only the comparable results but with the gated fusion of n-gram features by CNN and semantics by pre-trained universal sentence encoder, model is able to achieve the state-of-the-art results.

Compared with the existing methods, that are using both CNN and RNN for text classification, proposed model performs better. It is noticed that gated fusion can choose between short distance dependency based classification and long distance based dependency classification.

4 Conclusion

In this work, we proposed a method for text classification using n-gram features and semantic features captured by convolution neural network (CNN) and universal sentence encoder respectively. Proposed models are able to achieve state-of-the-art results on four common benchmark datasets. The proposed method can be applied to many other natural language processing tasks such as sentence similarity, machine translation (as an encoder) etc.

References

1. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, pp. 1-15. [ Links ]

2. Lee, J.Y. & Dernoncourt, F. (2016). Sequential short-text classification with recurrent and convolutional neural networks. arXiv preprint arXiv:1603.03827, pp. 1-5. [ Links ]

3. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., & Kochut, K. (2017). Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268, pp. 1-9. [ Links ]

4. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, pp. 1-6. [ Links ]

5. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, pp. 1-12. [ Links ]

6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 3111-3119. [ Links ]

7. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543. [ Links ]

8. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. Proceedings of the 2013 Conference on Empirical Methods in Natural Language, pp. 1631 -1642. [ Links ]

9. Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. Twenty-ninth AAAI Conference on Artificial Intelligence, pp. 2267-2273. [ Links ]

10. Zhou, C., Sun, C., Liu, Z., & Lau, F. (2015). A C-LSTM Neural Network for Text Classification. arXiv preprint arXiv:1511.08630, pp. 1-10. [ Links ]

11. Cer, D., Yang, Y., Kong, S.Y., Hua, N., Limtiaco, N., John, R.S., & Sung, Y.H. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175, pp. 1 -7. [ Links ]

12. Pang, B. & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Proceedings of the 43rd annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 115-124. DOI: 10.3115/1219840.1219855. [ Links ]

13. Li, X. & Roth, D. (2002). Learning question classifiers. Proceedings of the 19th International Conference on Computational Linguistics Association for Computational Linguistics, Vol. 1, pp. 1-7. DOI: 10.3115/1072228.1072378. [ Links ]

14. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems, pp. 649-657. [ Links ]

15. Pang, B. & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proceedings of the 42nd Annual Meeting on Association for Computational. DOI: 10.3115/1218955.1218990. [ Links ]

16. Wang, J., Wang, Z., Zhang, D., & Yan, J. (2017). Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification. IJCAI, pp. 2915-2921. [ Links ]

*Equal contribution of authors.

Received: January 25, 2019; Accepted: March 04, 2019

^** Corresponding author is Ajay Nagar. ajay.nagar@samsung.com

This is an open-access article distributed under the terms of the Creative Commons Attribution License