1 Introduction
Several social media platforms are generating enormous volumes of text data these days, which has sparked a renewed interest in data processing to uncover the data’s underlying meaning in a broader setting. Because Twitter data are accessible to the public and handled transparently, they may be used to investigate novel natural language processing (NLP) and data mining approaches, such as sentiment analysis [4]. The personal information, opinion, or polarity communicated in phrases or paragraphs may be extracted via sentiment analysis.
A valuable technique that offers real-time monitoring and decision-making capacities in the battle against the COVID-19 epidemic is sentiment analysis of data from social media platforms such as Twitter. This kind of analysis may be used to extract information from raw data. Many nations have implemented steps like isolation, quarantine, lockdown, or social distancing to address social media fears about the COVID-19 pandemic [13, 18].
However, different ethnicities and cultures have different methods of expressing their ideas. No matter the topic (health, politics, sports, or entertainment), people in one nation may react more passionately than others. Data-driven machine learning (ML) techniques predict [11, 23].
ML algorithms are widely utilized in health informatics [12, 5], pandemic predictions [13, 31], autism prediction [16], and many other fields. Many researchers have used ML systems to analyze Twitter sentiment. Villavicencio et al. [29] used the Naïve Bayes classifier to analyze COVID-19 vaccination tweets in the Philippines and obtained 81.77% accuracy. The classifier was tested on 11,974 manually tagged tweets. Khan et al. [17] used the Naïve Bayes classifier to sentiment score 50,000 COVID-19 tweets and discovered 19% positive and 70% negative tweets. The authors of [15] employed deep learning classifiers to categorize 600 COVID-19-related tweets by sentiment. H-SVM had the most remarkable accuracy (86%), recall (69%), and F1-score (77%), among the classification methods employed in their research. Gupta et al. [9] investigated Twitter users’ perceptions of the impact of weather on SARS-CoV-2 transmission.
The research filtered relevant tweets (n = 28555) using 11 ML algorithms and classified annotated tweets (n = 2442) into sentiment labels. The relevant tweet dataset showed 40.4% ambiguity regarding weather’s influence, 33.5% no effect, and 26.1% some effect on SARS-CoV-2 transmission. Latent Dirichlet Allocation (LDA) modeling was used to identify COVID-19-related topics from Twitter data [6, 1].
The researchers in [27] assessed machine learning classifiers on 7528 COVID-19 tweets. Automatic Twitter annotation yielded 93% accuracy in the trial. This research indicated that ML techniques were widely employed for COVID-19 tweet sentiment analysis and categorization. Due to the COVID-19 epidemic, no research has explicitly investigated ensemble ML models for sentiment analysis.
Nemez [22] employed a trained Recurrent Neural Network (RNN) to assess the percentage of positive, neutral, and negative attitudes in a coronavirus-related Twitter dataset. RNN forecasts showed 24.8% more positive tweets on May 13-14, 2020. Rustam [27] examined RF, XGBoost, SVC, ETC, DT, and LSTM for sentiment analysis. LSTM performed worse in that trial. The training data for the LSTM were insufficient.
Chakraborty [3] used a fuzzy approach using a Gaussian membership function to predict Twitter sentiment with 79% accuracy. According to particular research, sentiment analysis on Twitter data is difficult owing to the diversity, writing faults, and non-standard sentence patterns of user-generated information. This study analyzes COVID-19-related Twitter data using ensemble machine-learning methods and deep-learning models.
Voting, bagging, stacking, and BERT (Bidirectional Encoder Representations from Transformers) were tested for COVID-19 Twitter sentiment analysis. Coronavirus tweets’ NLP-Text Classification Kaggle competition data already contains a sentiment class [20].
The following section presents the proposed scheme for sentiment analysis of COVID-19 tweets. Furthermore, Section 3 discusses the ML and deep learning model findings in detail. The final section concludes the article and highlights its limitations.
1.1 Proposed Scheme
The methodological overview of the sentiment analysis process is shown in figure 1. In the data accession step, the COVID-19-related tweets data were collected from Twitter. Moreover, the collected dataset was preprocessed, followed by word representation, classification methods, and performance measurement.
1.2 Data Acquisition
We have collected English-language tweets related to the coronavirus that were posted on Twitter between January 1, 2020, and December 31, 2020, sourced from several countries around the world through the pandemic timeline, and they are available at [20].
A set of predefined and widely used science and news media terms related to coronavirus, such as “COVID-19”, “coronavirus”, “lockdown”, “isolation”, “quarantine”, “pandemic” and “ncov-2020” was used to collect tweets.
The data consisted of training data (41157) and testing data (3798). The sample of the tweets and the sentiment classes, according to Table 1 shows the sample Tweet Data for Sentiment Classification, and Figure 2, shows the data distribution in each class for data with five classes.
OriginalTweet | Sentiment |
The Home Depot is limiting the number of customers allowedinto its stores at any one time | Positive |
I SERIOUSLY DOUBT anyone will be voting for ANY RepublicanPlease wear a mask take hand sanitizer and vote these bastards out | ExtremelyNegative |
I thought I would save more money by being quarantined butonline shopping determined that was a lie. ???\r\r\n #CoronaCrisis | Extremely Positive |
1.3 Data Preprocessing
Raw data must be treated in a preprocessing stage before it can be successfully used with machine learning algorithms. This stage prepares the data to be used. This system performs its data preprocessing with the assistance of Natural Language Processing [24].
The text data are, first and foremost, changed to lowercase during this stage. This form has all stop words eliminated, and the corresponding contractions have been changed. In the Python NLTK package, a list of stop words is defined, which is used in this procedure.
Additionally, a custom function is developed to substitute contractions to finish the job. In order to reduce the likelihood of confusion, a check for spelling errors is carried out. The first step is to replace uppercase with lowercase. Following this step, the text will have any special characters, URLs, HTML tags, and stop words removed.
The text data is subjected to one more round of tokenization [14], normalization, and lemmatization. When it comes to natural language processing, there are three critical functions known as stemming, tokenization, and normalization used for preprocessing text before classification.
Tokenization: In natural language processing (NLP), tokenization divides text content into smaller components. A token is a name given to each unit. Every single word is turned into a token for this work [8].
-
Stemming: In stemming, the morphological forms of a word are converted back to their stems under the assumption that each form is semantically related to the others. The stem does not need to be a term already present in the dictionary. Nevertheless, after stemming is complete, all of the stem’s variants should map to this form. When utilizing a stemmer, two things need to be taken into consideration [21]:
(a) It is reasonable to presume that the various morphological variants of a word have the same core meaning, and they should thus all be mapped to the same stem.
(b) It is essential not to confuse words that do not have the same meaning with one another.
These two rules are sufficient so long as the stems produced are helpful for the programs we use for text mining and language processing. In most contexts, stemming is understood to function as a mechanism that improves recall. Compared to languages with a more complex morphology, the influence of stemming is not as strong in languages with a relatively simple morphology.
-
Normalization: It is the process of converting an odd text into its typical form.
People occasionally use a term unconventionally to convey their meaning [19]. This content has to be reformatted into its proper form, and any spelling errors need to be corrected.
Extracting Features: By extracting features from text and representing them as a vector of real numbers, a procedure known as “text feature extraction” can be performed [26, 10].
In this study, we used a technique called TF-IDF that generates a vector containing a set of real-valued features for each text, with the value of each feature depending on how often a specific word occurs in the text.
2 Building Models
Four different ML models were built using preprocessed tweets. The ML models were trained using the training dataset, while the performance of the models was evaluated using both the training and test datasets. The ML models are analyzed in detail in the following subsection.
2.1 Analyzing Machine Learning Models
2.1.1 Voting
A voting ensemble technique is a machine learning model that produces a single final prediction by combining the predictions of multiple machine learning models [28]. Because all the training data were used to train the models with this ensemble method, they should each have their personality. When performing regression tasks, the result is the mean of the predictions made by the models.
Instead, two methods are available: hard voting and soft voting, which can be used to estimate the final output of classification problems. Voting’s primary purpose is to enhance generality by correcting flaws specific to each model. This is especially important when the models perform well on a predictive modeling problem.
2.1.2 Bagging
Bagging subsamples of training data to improve one classifier’s generalization performance. Overfitting models benefit from this strategy. Bagging data from subsamples includes bootstrapping and aggregating. This method uses random sampling with replacement to resample the data, which overlaps training data. Regression voting or classification voting yields the final prediction for each data set. This strategy improves very little because the classifier’s hyperparameters do not vary from one subsample to another. This bias-reduction strategy is expensive and will not help with volatility. It reduces variance by better generalizing when the data is overfitted but not under fitted.
2.1.3 Stacking
Stacking ensemble models employ weighted voting to avoid all models contributing equally to the forecast. Stacking models have base models and meta-models (models that learn how to combine the predictions of the base models). Linear regression is used for regression, and logistic regression for classification. Out-of-sample base model predictions teach the meta-model. In other words, (1) data not used to train the base models are fed to them, (2) they make predictions, and (3) these predictions and the ground truth labels are utilized to fit the meta-model. Regression problems use predicted values. The affirmative class prediction is usually the input for binary classification problems. Finally, the multi-class classification uses the projected values for all classes.
2.1.4 BERT Classifier
BERT is a deep learning model that excels at NLP tasks. One output layer may fine-tune BERT’s deep bidirectional representation [2]. This paper used BERT-Base. Moreover, BERT-Base has 12 layer/transformer blocks, 768 hidden units, and 12 self-attention heads with 110 M optimized parameters. BERT employs a 30000-word set of fundamental embeddings [30].
The input representation is the token, segment, and position embeddings total. Furthermore, for preprocessing data, both [CLS] and [SEP] were used as a classification token and a sentence marker, respectively. Additionally, the sentiment categorization output layer comprises [CLS] representation.
3 Experimental Results and Discussion
This section briefly discusses the study of different ML ensemble algorithms for the category of user sentiment under different labels (extremely positive, positive, extremely negative, negative, and neutral). The ML models were created and examined using the scikit-learn [25] package and the Python programming language.
The manually labeled dataset was split 80/20 randomly between the training and testing phases. As a result, 80% of the data were classified as training data, and 20% as testing data. The grid search tuning approach [7] was used to tune the hyperparameters, which can regulate how the algorithms learn, to identify the best hyperparameters for the utilized models.
The algorithms’ performance was evaluated using precision, recall, and the F1-score. The experiment was conducted to discover the best parameters for each method used to classify the sentiment data of the COVID-19 tweets. Tweets with five classes were used in the experiment.
The first time we used the popular machine learning algorithms with the data representation of TF-IDF, the figure 3 shows the results obtained, such as the algorithms used: Logistic Regression (Lr), Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Random Forest, Gradient Boosting, and AdaBoosting. Then the best models (such as those with accuracy above 50%) were selected to build the ensemble models (voting, bagging, and staking). Finally, the BERT models were used.
3.1 Voting Classifier (VC) Setup
To obtain the final predicted labels, hard voting, also known as majority voting, was used in this study among the Decision Tree (DT), Support Vector Classifier (SVC), and Logistic Regression (LR). The precision, recall, and F1-score for the VC model on the test dataset were 98.9%, 99.5%, and 99.3%, respectively.
3.2 Bagging Classifier (BC) Setup
The outputs from the predictive models are then applied to a voting scheme for better categorization. The basic estimator for training the BC model in this investigation was a Logistic Regression with
3.3 Stacking Classifier (SC) Setup
The proposed SC model’s design consisted of two levels. The VC and BC models discussed above made up the first layer of the SC model, and a logistic regression model made up the second layer.
For every observation and test in the dataset, two distinct models were used to generate the conclusions. The judgments attained by these methods served as input features for the second-layer LR model.
The second-layer model then delivered the result based on the input features. The SC model’s accuracy, precision, recall, and F1-score were 64%, 64%, 65%, and 65% on the training set, respectively (see Table 2).
3.4 BERT Setup
The BERT process was divided into two stages: pre-training and fine-tuning. The BERT architecture was trained on several tasks using unlabeled data during the pre-training phase. This was achieved so that it could be used later. Then, for fine-tuning, BERT was trained on the data utilized in this research, namely the tweets from the COVID-19 event. During the fine-tuning process, the used parameters included learning rates of 10-5, a batch size of 32, and a maximum iteration of 15 epochs.
Within the framework of the sentiment categorization method that uses BERT, the following parameters were observed: (a) the total number of effective classes is five; (b) the learning rate is 10e-5. Table 2 and figure 4 present the findings of the performance evaluation of the sentiment categorization using BERT.
The best performance achieved was 0.74 (precision), 0.74 (recall), and 0.74 (F1-score) for classifying the five sentiment classes.
4 Conclusion
In this paper, the sentiment classification of the COVID-19 tweets dataset was investigated by comparing two sentiment classification schemes. The first scheme included ensemble ML models to classify tweets into five classes.
The Stacking Classifier showed the highest F1 score of 65% in this scheme, while Voting Classifier and Bagging Classifier models showed promising results, indicating that ensemble ML models can be used for sentiment analysis. The second scheme is sentiment classification using BERT.
The classification results achieved by BERT were better than the first scheme, reaching 74% (F1-score), 74% (precision), and 74% (recall) for the classification of five sentiment classes. Future studies may focus on trying different encoders, such as the variants of BERT and Word2vec, for text embedding to find the best suitable encoding for the classifiers and get better outcomes.