1 Introduction
According to certain studies, mental illness can impair a person’s physical health as well as her/his intellect, feelings, and behavior (or all three) [50, 32].
450 million people are affected by mental health problems such as depression, schizophrenia, attention-deficit hyperactivity disorder (ADHD), autism spectrum disorder (ASD), etc. [50]. Early diagnosis of mental illness is a fundamental step in better understanding mental health problems and providing care.
Mental illness is usually diagnosed based on self-reporting by individuals in specific surveys designed to diagnose specific patterns of feelings or social interactions, in contrast to the diagnosis of other chronic illnesses, which are based on tests and measurements in research settings [19].
In these uncertain times, with COVID-19 torments the world, many people have indicated clinical anxiety or depression. This could be due to lockdown, limited social activities, higher unemployment rates, economic depression, and work-related fatigue.
American Foundation for Suicide Anticipation reported that individuals encounter anxiety (53%) and sadness (51%) more regularly now than before COVID-19 was widespread. Within the past decade, social media has changed social interaction.
In addition to sharing data and news, people share their daily activities, experiences, hopes, emotions, etc., generating reams of data online.
This textual data provides information that can be utilized to design systems to predict people’s mental health. Moreover, the current limited social interaction state has forced people to express their thoughts on social media.
In addition, because social interaction is currently limited, people are compelled to express their thoughts on social media. It gives people an open stage to share their opinions with others to find help [35].
Studies that address mental illness primarily utilized deep learning [42, 35] and traditional machine learning [36, 8] models. Recently, Transformer models [47] have gained attention with improvements in Natural Language Processing (NLP) [11, 16, 49] and Computer Vision (CV) [51, 24].
In this work, we adopted the Transformer model that is encoder part of the vanilla Transformer [47] to encode multi text (title and post) simultaneously. We theorize that encoding multiple texts with the same model can improve the quality of the mental illness problem.
We also conducted extensive experiments on late fusion methods to merge the outputs of the proposed model efficiently. We also applied traditional Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) approaches to compare the proposed model for automatically detecting mental disorders in social media texts.
Used reddit.comfn user data proposed by Murarka and Radhakrishnan [35] to determine mental illness, Table 1 represents instances of the dataset. The rest of the paper is structured as follows: Section 2 describes the studies on mental illness in literature. Sectio 3 explains the problem and gives dataset insights.
No. | Reddit Post | Label |
1 | all the ideas that normally disappear as soon as we reach for a writing device will be captured and started. imagine all the projects we will begin and never finish! | ADHD |
2 | i know this is long and i don’t know if a lot of people will read this but i really just want to help. i had 2 panic attacks over the end of february and first day of march. i went to the doctor and had my blood work | Anxiety |
3 | for example, did you ever notice that you had manic, hypomanic, depressive, etc. episodes? did you ever notice that sometimes you were ßad” and other times you were ëxcessively happy”? i’m in a sticky | Bipolar |
4 | i just feel so trapped and i *have* to do something about it. i don’t know where i’ll go or what i’ll do to get by. i just can’t stay here any longer. | Depression |
5 | this is probably going to incite a lot of disagreement, maybe even anger, but that’s okay; i’m going to say it anyway. anyone else tired of being told that just talking about your problems will solve your ptsd? | PTSD |
6 | synesthesia. what is synesthesia? according to google, ßynesthesia is a condition in which one sense (for example, hearing) is simultaneously perceived as if by one or more additional senses such as sight. | None |
Section 4 gives details of the methodology applied to detect mental disorders with baseline models. Section 5 presents results and their analysis. Section 6 concludes the paper with possible future work.
2 Related Work
Recently, individuals have been using social media to communicate and seek advice on mental health issues. This has motivated researchers to take the information and apply various NLP and ML approaches to help individuals who may want assistance. Initially, many researchers have focused on Twitter text [37, 7, 10], later on the focus has shifted on Reddit platform [25, 17, 7, 52].
A wide range of approaches has been applied to mental health text analysis, from traditional ML to advanced DP. ML points to creating computational algorithms or statistical models capable of extracting hidden patterns from data [39, 44].
For a long time, an increasing number of ML models have been created to analyze healthcare data [36, 8]. Traditional ML approaches require a significant amount of feature engineering for ideal performance, an essential step for most application scenarios to obtain excellent performance and time [15].
Contextual content is created using words. Important insights into text classification can be gained from its structure and order [6, 2]. In the literature, several researchers have extracted the word n-grams to classify user content in social media. [25] used the word n-grams to detect mental illness from Reddit posts.
Another study [23] utilized word n-grams to generate and evaluate artificial mental health records for NLP. According to Coppersmith et al. [10], they employed character-level language models to see how probable a user with mental health concerns would create a series of characters.
Benton et al. [7] determined different types of mental health disorders by applying neural MTL, regression, and multi-layer perceptron single-task learning (STL) models.
Abussa et al. [1] trained the Support Vector Machines to distinguish 200 text messages into two classes: “ADHD or not.” The most crucial step was eliminating the acronym ADHD from the messages before learning, and further information concerning attention disorders was removed from the texts.
The goal was to see how well the Support Vector Machine learns when keywords and even semantically relevant material are unavailable. Deep feed-forward neural network has outperformed typical ML models in a variety of data mining tasks [5, 3, 2, 4], and it has been used in the study of clinical and genetic data to predict mental health disorders.
To diagnose depression, Orabi et al. [37] used word embeddings in combination with a range of neural network models such as CNNs and RNNs. To conduct binary classification on mental health textual posts, Gkotsis et al. [17] used Feed Forward Neural Networks, CNNs, traditional ML such as Support Vector Machine, and Linear classifiers. Sekulic and Strube [41] detected depression, ADHD, anxiety, and other types of mental illnesses by training a binary classifier for each disease with Hierarchical Attention Networks.
The most recent work on this was a CNN-based classification model Kim et al. [25]. The team trained a separate binary classifier for each type of mental disorder to conduct the detection. Hu and Sokolova [21] found the potential factors to influence a person’s mental health during the Covid-19 pandemic by applying ML classifiers such as Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Gradient Boosting (GB).
They have also presented an analysis of the feature selection technique called LIME (Local, Interpretable Model-agnostic Explanations) [40]. In a recent study, Shatte et al. [42] applied ML techniques to the mental health domain.
They reviewed the literature using four key application domains—detection and diagnosis, prognosis, treatment and support, public health applications, and research and clinical administration. Another research examined the recently developed field of DL methods in psychiatry. They concentrated on DL and integrated statistical ML correlations with semantically interpretable computer models of brain dynamics or behaviour [14].
A variety of cutting-edge NN models were employed in DL-based methods. Shared task CLPsychfn series played a significant part in developing mental health detection. CNN, RNN, LSTM, and BiLSTM were found to be the most commonly applied models.
In today’s research world, TL is extremely important. Researchers attempt to acquire greater accuracy and performance in several research studies by using several types of transformers. Murarka et al. [35] examined three approaches for identifying and diagnosing mental illness on the Reddit dataset, including LSTM, BERT, and RoBERTa.
RoBERTa outperformed the other two methods. Dhanalaxmi et al. [12] employed RoBERTa to categorize COVID-19-related informative tweets, and their method yielded the best results. Mathur et al. [33] applied LSTM with an attention mechanism to estimate suicidal intent using temporal psycholinguistics.
Shickel et al. [43] utilized a deep transfer learning model to predict emotional valence in mental health text and achieved the highest performance with BERT. Moreover, they claimed that in automatic mental health systems, where labeled data is frequently scarce, recent transfer learning algorithms should become a crucial component.
Du et al. [13] looked at approaches for identifying suicide-related psychiatric stresses in Twitter data using deep learning-based approaches and a transfer learning approach that uses an existing clinical text annotation dataset. They demonstrated the advantages of deep learning-based techniques compared to conventional machine learning algorithms.
Additionally, it was discovered that the transfer learning technique might potentially reduce annotation work and further improve performance. To automatically detect public opinions, behavioral intentions, and attitudes concerning COVID-19 vaccinations from Tweets, Cagliero and Garza [29] used transfer learning with a pre-trained BERT model.
They showed that transfer learning models outperformed traditional machine learning models. To summarize, ML and DL techniques have been used in health care problems as efficient methods using text on social media platforms due to their ability to outperform naive learning models significantly [9].
Motivated by these models, we proposed a Transformer model with late fusion methods to combine the title and post of the dataset into the model to detect the mental disorders of individuals. To the best of our knowledge, none of the prior studies have applied the Transformer model with late fusion models for the mental illness problem.
3 Problem Description and Dataset
3.1 Mental Illness Problem
Mental illness problem is a multi-class classification problem where a given text is classified into one of the six following mental disorder classes:
– ADHD: A mental condition that impairs your ability to focus, maintain stillness, and control your actions (common in children)fn.
– Anxiety: A feeling of uneasiness, fear, and dreadfn.
– Bipolar: Extreme mood swings, including emotional highs and lows, are a symptom of a mental health issuefn.
– Depression: A widespread and significant medical condition that has a negative impact on how someone feels, thinks, and actsfn.
– PTSD: A condition that some people experience after going through a stressful, terrifying, or deadly experiencefn.
– None: No mental illness.
3.2 Dataset
Murarka et al. [35] developed a benchmark multi-class dataset from the Reddit social media platform for mental illness detection.
The dataset comprises a total of
Table 2 presents the number of posts for each mental illness class.
4 Model
4.1 Transformer Model
Transformer model [47] is gaining interest due to state-of-the-art performance in NLP tasks such as machine translation [48, 30], and sequence tagging [46, 20]. The Transformer model comprises encoder-decoder architectures that process sequential data in parallel without a recurrent network.
Instead of paying attention to the last state of the encoder, as is common with RNNs, the encoder architecture in Transformer extracts information from the whole sequence. This allows the decoder to assign greater weight to a certain input element for each output element.
In this study, we proposed Transformer models based on the vanilla Transformer proposed by Vaswani et al. [47] and used the encoder module of the Transformer to perform classification by mapping the data to the mental illness classes. The architecture of the Transformer model is shown in Figure 2.
Let
where
How do title and post contribute to the predictions? Over the years, various fusion techniques (e.g., early fusion or late fusion) have been developed for prediction in computer vision [22, 18] and NLP tasks [45, 34].
Since there are two parts for each instance (title and post), we also applied late fusion combining the outputs of each model at the classification layer. Moreover, we tried various combinations of methods in experimental settings (e.g., concatenation, average, maximum, minimum, weighted average).
4.2 Baseline Models
Since the mental illness dataset used in this study is relatively new, we applied ML, DL, and TL algorithms to get baseline scores. The models are summarised as follows:
– Machine Learning Classifiers: We applied four different ML classifiers, including Random Forest, Linear Support Vector Machine, Multinomial Naive Bayes, and Logistic Regression using sckit-learn libraryfn.
– Deep Learning Methods: We applied base DL models: LSTM, BiLSTM, and CNN. The pre-trained embeddings were used as the input layer, and the softmax layer as the output layer of the models.
– Transfer Learning Methods: Transformer-based pre-trained language models (PLMs) such as BERT [11], RoBERTa [31], AlBERT [27] have shown state-of-art performance in many down-stream NLP tasks. The PLMs used in NLP problems, called transfer learning models, yielded top results in various NLP tasks without critical task-specific design changes [28, 11]. We employed the BERT, AlBERT, and RoBERTa models in this study.
5 Results and Analysis
5.1 Experimental Setting
We implemented the proposed DL and TL models using the PyTorch library [38]. The Adam optimizer [26] was used with an epsilon value of
We utilized pre-trained language models (BERT [11], RoBERTa [31], etc.) to convert words into embeddings. To tokenize the words, we set the maximum length
For TL models, we added an output layer with a softmax function for training and set the learning rate to
In ML models, the number of features in each experiment was set to 1,000, i.e., we used the n-grams with the highest TF-IDF values. For the combination of word n-grams, the length of
Since the dataset was already pre-processed by eliminating URLs or usernames containing sensitive material, we did not apply any pre-processing techniques before classification. We fine-tuned the models using the development set of the dataset.
Table 3 shows the parameter settings of DL and the proposed transformer models. We evaluated the models using the following three metrics: micro precision, micro recall, and micro F1-Score.
5.2 Main Results
Table 6 presents the proposed models’ results and comparison with baseline models. In this Table, “ML Algorithms” indicates traditional ML algorithms. The “LinearSVC” indicates Linear Support Vector Classifier, “LR” indicates Logistic Regression, “NB” indicates Naive Bayes, and “RF” indicates Random Forest classifier.
Method | Precision | Recall | F1 score |
Concatenation | 89.86 | 89.58 | 89.65 |
Average | 88.06 | 87.70 | 87.78 |
Weighted Average | 86.68 | 85.69 | 85.85 |
Maximum | 87.81 | 87.50 | 87.60 |
Minimum | 88.04 | 87.84 | 87.86 |
Predicted | |||||||
ADHD | Anxiety | Bipolar | Depression | PTSD | None | ||
True | ADHD | 224 | 10 | 5 | 6 | 3 | 0 |
Anxiety | 1 | 222 | 1 | 19 | 5 | 0 | |
Bipolar | 9 | 8 | 211 | 16 | 3 | 1 | |
Depression | 3 | 8 | 12 | 219 | 6 | 0 | |
PTSD | 0 | 19 | 9 | 7 | 213 | 0 | |
None | 0 | 1 | 1 | 1 | 1 | 244 |
Model | F1 | Precision | Recall |
Transformer | 89.65 | 89.86 | 89.58 |
Classical Machine Learning | |||
ML Algorithm | F1 | Precision | Recall |
LinearSVC | 77.18 | 77.66 | 77.15 |
LR | 77.87 | 78.24 | 77.89 |
NB | 66.49 | 72.18 | 66.73 |
RF | 70.85 | 72.46 | 70.50 |
Deep Learning | |||
DL Algorithm | F1 | Precision | Recall |
CNN | 81.64 | 82.84 | 82.65 |
LSTM | 83.73 | 84.10 | 83.60 |
BiLSTM | 83.84 | 84.06 | 83.74 |
Transfer Learning | |||
TL Algorithm | F1 | Precision | Recall |
BERT | 80.82 | 80.87 | 80.85 |
AlBERT | 80.45 | 80.90 | 80.38 |
RoBERTa | 84.41 | 85.10 | 84.41 |
State-of-the-Art | |||
Method | F1 | Precision | Recall |
RoBERTa | 89 | 89 | 89 |
The “DL Algorithms” indicates DL algorithms used in this study, such as CNN, LSTM, and BiLSTM. The “TL Algorithms” refer to pre-trained TL algorithms applied to evaluate Reddit corpus, i.e., BERT, XLNet, AlBERT, and RoBERTa. Using traditional ML algorithms, overall, best results (F1 =
In DL models, the overall best results are achieved with BiLSTM (F1 =
Since RoBERTa pre-trained model in TL methods yielded the best results, we used RoBERTa pre-trained embeddings as the input layer of the DL models (CNN, LSTM, and BiLSTM) and the proposed Transformer model.
The state-of-the-art RoBERTa [35] model was trained on title + post text, which is different from our RoBERTa model as we trained it on posts only. Among the baseline models (ML, DL, and TL), RoBERTa outperformed the traditional ML and DL models with an F1 score of
Overall, we obtained the highest score with the proposed Transformer model with the
Table 5 shows the confusion matrix of the proposed Transformer model with the concatenation late fusion method. The model is good at predicting non-illness samples. However, it confuses at prediction of the classes anxiety, bipolar, depression, and ptsd.
The terms
This exhibits the actual potential of our approach since it does not depend solely on the mention of class names in the post but also has a deep awareness of the post’s context.
5.3 Data
To understand the impact of the dataset comprising titles and posts, we performed experiments with the proposed Transformer and the baseline models using title and post separately. Table 7 represented the F1, Precision, and Recall scores of Transformer, traditional ML, and DL models.
Title | Post | |||||
Model | F1 | Precision | Recall | F1 | Precision | Recall |
Transformer | 70.09 | 70.46 | 70.09 | 83.63 | 83.90 | 83.53 |
ML Algorithms | F1 | Precision | Recall | F1 | Precision | Recall |
Classical Machine Learning | ||||||
LinearSVC | 65.48 | 65.78 | 65.52 | 77.18 | 77.66 | 77.15 |
LR | 65.37 | 66 | 65.52 | 77.87 | 78.24 | 77.89 |
NB | 62.46 | 68.56 | 62.37 | 66.49 | 72.18 | 66.73 |
RF | 61.63 | 62.02 | 61.63 | 70.85 | 72.46 | 70.50 |
Deep Learning | ||||||
DL Algorithm | F1 | Precision | Recall | F1 | Precision | Recall |
CNN | 69.94 | 70.85 | 69.76 | 81.64 | 82.84 | 82.65 |
LSTM | 71.46 | 72.07 | 71.44 | 83.73 | 84.10 | 83.60 |
BiLSTM | 70.12 | 70.65 | 69.89 | 83.84 | 84.06 | 83.74 |
Transfer Learning | ||||||
DL Algorithm | F1 | Precision | Recall | F1 | Precision | Recall |
BERT | 70.06 | 70.29 | 69.96 | 80.82 | 80.87 | 80.85 |
AlBERT | 67.37 | 67.58 | 67.34 | 80.45 | 80.90 | 80.38 |
RoBERTa | 70.68 | 71.27 | 70.56 | 84.41 | 85.10 | 84.41 |
We obtained the best results with the posts (F1 =
To understand the impact of the methods, Table 8 presented the class-wise results of the Transformer using the title and post separately and concatenating them.
Title | Post | Title + Post | |||||||
Class | Precision | Recall | F1 Score | Precision | Recall | F1 Score | Precision | Recall | F1 Score |
ADHD | 68 | 77 | 72 | 85 | 87 | 86 | 95 | 90 | 92 |
Anxiety | 60 | 69 | 64 | 72 | 83 | 77 | 83 | 90 | 86 |
Bipolar | 64 | 55 | 59 | 82 | 78 | 80 | 88 | 85 | 87 |
Depression | 69 | 65 | 67 | 80 | 77 | 78 | 82 | 88 | 85 |
PTSD | 67 | 65 | 66 | 86 | 83 | 84 | 82 | 88 | 85 |
None | 95 | 90 | 92 | 98 | 94 | 96 | 82 | 88 | 85 |
We performed experiments using the proposed Transformer model to get insights on the class-wise performance of our proposed Transformer model on titles and post text separately and by combining them. The Transformer model obtained a
The best performing class among the mental disorders was
5.4 Late Fusion
To extend the impact of the data on the problem, we applied late fusion (Figure 1). The results of the methods used in late fusion are shown in Table 4. We used RoBERTa [31] pre-trained embeddings in the models with the same parameters for each model (See Table 3 for hyperparameter settings of the models).
The results showed that all methods improved the results compared to the Transformer model using only one input (title or post). We achieved the highest score (F1 =
ADHD | Anxiety | Bipolar | Depression | PTSD | None | ||
Concatenation | Precision Recall F1 | 95 | 83 | 88 | 82 | 92 | 100 |
90 | 90 | 85 | 88 | 86 | 98 | ||
92 | 86 | 87 | 85 | 89 | 99 | ||
Average | Precision Recall F1 | 94 | 82 | 86 | 78 | 92 | 98 |
90 | 87 | 83 | 86 | 82 | 97 | ||
92 | 85 | 84 | 82 | 87 | 97 | ||
Weighted Average | Precision Recall F1 | 94 | 69 | 88 | 83 | 88 | 98 |
84 | 91 | 82 | 77 | 84 | 97 | ||
89 | 79 | 85 | 80 | 86 | 97 | ||
Maximum | Precision Recall F1 | 94 | 81 | 81 | 80 | 90 | 100 |
89 | 86 | 86 | 83 | 84 | 97 | ||
92 | 84 | 84 | 82 | 87 | 98 | ||
Minimum | Precision Recall F1 | 92 | 81 | 87 | 81 | 89 | 98 |
92 | 86 | 81 | 88 | 82 | 98 | ||
92 | 84 | 84 | 84 | 86 | 98 |
It can be observed that the late fusion method of concatenation performed better on all classes than other methods. Moreover, there is not much difference in the performances of the late fusion methods. It can be inferred that the method can be used for datasets containing two or more texts to increase performance.
6 Conclusion
The present Covid-19 outbreak and globally forced isolation were our primary motivations for multi-class mental illness detection efforts. We believe that social media platforms have become the most widely used communication medium for individuals, allowing them to express themselves without fear of judgment.
We applied the Transformer model with fusion methods and state-of-the-art traditional ML, DL, and TL-based methods for multi-class mental illness detection problem. The best results (see Table 4) were obtained with the Transformer model with concatenation late fusion method (F1 score =
In the future, we plan to develop a multi-label mental illness dataset, which would be more reflective of the situation than a multi-class dataset, as a post can have more than one mental disease instead of one per post, i.e., depression and anxiety.
We can also use the data augmentation technique on top of existing mental health data [35]. Moreover, we plan to apply other TL-based models, such as DistilBERT, in the future. An ensemble modeling would also be considered to improve classification performance.