1 Introduction
Lemmatization is an important data preparation step in many Natural language Processing (NLP) tasks such as Information Extraction (IE) and Information Retrieval (IR), among others. The aim of lemmatization is to determine the base form of a word (lemma) [11].
A number of approaches have been developed for lemmatization, ranging from those relying on rule-based techniques [19] and simple statistical-based methods [39] to the modern deep-learning methods: see, for example, the Stanford CoreNLP [27], a neural lemmatizer for Bengali [8] and for German, Czech and Arabic [24].
In our work, lemmatization is treated by building tree classification models [14], i.e., by supervised machine learning with decision trees that are constructed corresponding to the grammatical features of the language.
Researchers have faced with difficulties while lemmatizing words by different approaches. The main difficulty of a rule-based word lemmatization is that it is challenging to adjust existing rules to new classification tasks [32]. When social media texts are processed, it can be impractical to collect a predefined dictionary due to the fact that the language variation is high [22].
Concerning low-resource languages, it is hard to collect corpora and compile dictionaries for such languages [23]. Part-of-Speech (POS)-tagging, as one of the preliminary steps of lemmatization, is also difficult because some languages have up to 30 different word forms for the same normalized words [32].
Our method is a direct supervised approach of building word lemma classification. Our approach estimates the possibility of computing syntactic models using only datasets in the form of wordform–lemma dictionaries. We present an open-sourcefn multilingual Random Forest Classifier-based lemmatizer that has been shown to support twenty-five languages. This lemmatizer is a continuation of our previous work [1], where we used Decision Tree Regression method. That model caused a character shift errors leading to poor accuracy; this does not happen in the lemmatizer presented in this paper because of using a classification algorithm instead of regression.
We compare our lemmatizer with UDPipe, an open-source tool for lemmatization.fn Our evaluation shows that our classification tree-based lemmatizer achieves much better results than UDPipe does when our algorithm is provided with sufficient amount of training data.
This paper is organized as follows. We begin from a brief review of related works on lemmatization in Section 2. In Section 3, we describe a dataset, explain the method of generating vectors from the words in the dataset based on character co-occurrence matrix and TF-IDF vectorizer [31], present our approach based on Decision Tree and Random Forest Classifiers and give the steps of our lemmatization algorithm. In Section 4, we present the obtained results. Section 5 concludes the paper and outlises future extensions and possible research directions.
2 Related Work
To identify papers related with the present research, we have searched Google Scholar and Semantic Scholar. Our query terms included language-independent word lemmatization, neural architectures for lemmatization, and machine learning for lemmatization, among others. We arranged the resulting papers from each query by citation count and took at least top three. We considering a paper only if it introduced original ideas of a method or an algorithm.
2.1 Rule-based Approaches
Conventional algorithms for text lemmatization are based on rules. It is worth mention that rules can be expressed by the apparatus of fuzzy [13] or predicate [37] logic. The logical rules applied to finite-state transducers, with the help of a lexicon, define morphotactic and orthographic alternations.
As a result, a system based on such rules can solve several tasks, such as stemming, lemmatization, and full morphological analysis [2, 10]. The advantages of such an approach include transparency of the algorithm’s outcome and the possibility of fine-tuning.
However, there are also disadvantages, such as the so-called problem of out-of-vocabulary (OOV) words, that leads to the need of intensive manual support for the vocabulary of many thousands of words.
In addition, there exist approaches that automatically generate rules from the dataset of pairs of the word and its normal form. For instance, [25] with the help of a decision tree predicted particular letters of the transformed word based on the letters in the form of the past tense.
Another approach relies on relational learning with decision lists applied to English verbs in the past tense [30].
2.2 Statistical Approaches
Various approaches to NLP have been influenced by ideas from statistics methods, such as Hidden Markov Model (HMM) and Conditional Random Fields (CRF), among others.
Researchers adopted HMM for POS tagging and approximation of language model for speech recognition systems. These methods have difficulties in estimating transitional probabilities on a small amount of data.
Besides, for good accuracy performance of such methods, there is a need for the large manually annotated corpora to approximate the probabilities [16, 12, 4].
2.3 Neural Approaches
Nowadays, neural approaches are prevailing over a great variety of algorithms in the task of text lemmatization. The advantage of artificial neural networks can be explained by the simplicity of development, the possibility of multi-task learning, and application in multi-criterial optimization.
Conventional language models can be easily presented in terms of a universal neural estimator.
The most popular idea in this field is the sequence-to-sequence model (S2S), which can be used for contextual lemmatization. The main idea behind the S2S model is the attention mechanism, which leads to good accuracy performance and to reducing the number of parameters to be computed [28].
3 Methods and Data
3.1 Datasets
For this research, we used Lemmatization lists [29] for 23 languages publicly available under the Open Data Base License (ODbL);fn see Table 1.
Language | Code | Language group | Word pairs | Source |
Asturian | ast | Romance | 108,792 | Lemmatization lists |
Bulgarian | bg | Slavic/Baltic | 30,323 | Lemmatization lists |
Catalan | ca | Romance | 591,534 | Lemmatization lists |
Czech | cs | Slavic/Baltic | 36,400 | Lemmatization lists |
English | en | Germanic | 41,649 | Lemmatization lists |
Estonian | et | Ural/Altaic | 80,536 | Lemmatization lists |
Farsi | fa | Iranian | 6,273 | Lemmatization lists |
French | fr | Romance | 223,999 | Lemmatization lists |
Galician | gl | Romance | 392,856 | Lemmatization lists |
German | de | Germanic | 358,473 | Lemmatization lists |
Hungarian | hu | Ural/Altaic | 39,898 | Lemmatization lists |
Irish | ga | Gaelic | 415,502 | Lemmatization lists |
Italian | it | Romance | 341,074 | Lemmatization lists |
Manx Gaelic | gv | Gaelic | 67,177 | Lemmatization lists |
Portuguese | pt | Romance | 850,264 | Lemmatization lists |
Romanian | ro | Romance | 314,810 | Lemmatization lists |
Russian | ru | Slavic/Baltic | 2,657,468 | Zaliznjak dictionary |
Scottish Gaelic | gd | Gaelic | 51,624 | Lemmatization lists |
Slovak | sk | Slavic/Baltic | 858,414 | Lemmatization lists |
Slovenian | sl | Slavic/Baltic | 99,063 | Lemmatization lists |
Spanish | es | Romance | 496,591 | Lemmatization lists |
Swedish | sv | Germanic | 675,137 | Lemmatization lists |
Turkish | tr | Ural/Altaic | 1,337,898 | Zargan dictionary |
Ukrainian | uk | Slavic/Baltic | 193,704 | Lemmatization lists |
Welsh | cy | Gaelic | 359,224 | Lemmatization lists |
Additionally, for Russian language we used Zaliznjak’s dictionary [41] and for Turkish we used Zargan dictionary [18].
The language group representation of our data is unbalanced, with the majority of languages being Romance and Slavic / Baltic, followed by the Gaelic and Germanic languages. The distribution of the data we have collected by the number of words for a language group is presented in Fig. 1 and Table 2.
Language groups | Total number of words |
Slavic/Baltic | 3,875,372 |
Romance | 3,319,920 |
Ural/Altaic | 1,458,332 |
Germanic | 1,075,259 |
Gaelic | 893,527 |
Iranian | 6,273 |
Total | 10 628 683 |
We can observe that Uralic / Altaic group, represented by only two languages, is greater than such groups as Germanic and Gaelic by the number of wordform–lemma pairs.
This is because of enormous Turkish language data. Same effect can be observed for Slavic / Baltic language group, mainly because of Russian language data.
3.2 Method
3.2.1 Character Co-occurence Embeddings
For converting words in wordform–lemma pairs to vectors, we used the following method for building the character co-occurrence matrix.
Calculating TF-IDF All the words were converted to vectors at a character level by the TF-IDF vectorizer in scikit-learn implementation of the method based on the works [26] and [21]:
where c stands for a character, w for a word and len(w) for the length of the word in characters.
2. Calculate the inverse document frequency (inverse word frequency iwf, in our case) as
where N is the total number of words in the corpus, and |{ w ∈ W | c ∈ w }| is the number of words where the character c appears (cf(c, w) ≠ 0).
3. Calculate the term frequency-inverse document frequency (character frequency-inverse word frequency, in our case) as
where the size of the matrix is defined by the number nw of words in the corpus and the number nc of unique characters found in all of the words in the corpus.
Calculating the Co-occurrence Matrix We multiplied the transpose of the matrix mx by the matrix itself to find the cooccurence matrix as
which yields a matrix of size (nc, nc), every row or column of which is serving as the embedding for a corresponding character.
The character co-occurrence embeddings can store the character semantic distribution information [18] in the word context for a given language, reflecting the phonetic patterns and their similarity [34].
3.2.2 Decision Tree Classifier
For lemmatization, we used the Decision Tree Classification as the base, extending it to an ensemble method called Random Forest Classifier as explained below.
Our selection of this classifier was based on the fact that only K-Nearest Neighbors Classifier [3], Radius Neighbors Classifier and tree algorithms support multiclass-multioutput [31] or multitask classification. However, the first two algorithms require reducing the number of features used to less than ten and have a complexity of O(N2) or O(N × log(N)), whereas tree algorithms do not require dimensionality reduction and have a complexity of O(nsamplesnfeatures log(nsamples)). The reason behind the selection of Random Forest technique out of tree algorithms is explained in Section 4.1.
The Decision Tree method is well known from ancient times [5]. It was first formalized by Hoveland and Hunt in late 1950s and further elaborated in [36]. The classifier builds a tree starting from the root question: the feature that separates the elements into two groups according to a criterion (Gini coefficient, entropy, or variance) [15, 35, 33]; in our case, entropy or Information Gain criterion was used, such that each group contains similar elements. The process continues iteratively for each group until a stopping criterion is met, which can be:
— the specified depth of the tree is achieved,
— all the items on a leaf are of the same class or one item is left on a leaf,
— more than N elements are left on a leaf, or
— further branching does not enhance the homogeneity of items on a leaf beyond certain value.
In our case, we go for multiclass classification and the data is represented in the form of
where x1−k are the independent variables associated with the features and Y1−m are the dependent variables or targets. The information gain (IG) criterion is based on the concept of entropy heavily used by physicists in thermodynamics [9] and introduced for information by Shannon [35]. It is defined as follows:
where p1, p2,..., pj are fractions that sum up to 1 and show the share of each class presence in the child node that results from a split in the tree [40]. So, the formal criterion can be calculated as
where H(T) is the entropy of the parent node and H(T | a) entropy of a child.
3.2.3 Random Forest Classifier Method
To counter some of the disadvantages of the Decision Tree Classifier, which include easy overfitting and non-robustness [20], we exploited the Random Forest ensemble technique [6]. Here, the method implies using a random subset from the training set with replacements; the most discriminative thresholds are drawn at random for each subset and the best of these randomly-generated thresholds is picked as the splitting rule (thus we employed a heuristic methodology similar to Variable Neighborhood Search [17]). Despite relatively low classification power of each individual tree in the forest, the cumulative classification power is increased through averaging (by canceling out the errors) and voting processes [31]. This usually leads to the reduction of the model variance, at the expense of a modest increase in the bias.
The scikit-learn [31] implementation of the Random Forest allows for bootstrapping, using a random subset of the dataset for estimator instance training leading to a leaner and more robust model, and using the parallelization in computations to increase the effectiveness of the training process. The module also uses averaging of the estimators probabilistic predictions [31], contrary to the original paper’s method of each classifier voting for a single class [6].
3.2.4 Lemmatization Algorithm
These steps we used for each language can be described as follows:
— given the dictionary of wordform–lemma pairs, assign them as independent (X) and dependent (Y) variables for applying the machine learning approach;
— prepare the character co-occurrence matrix where each row or column will serve as an embedding vector for the corresponding symbol;
— encode the words in X by the character embeddings, producing the vectors of length of the longest word in the corpus and flattening it;
— encode words in Y by the character ordinal number, to carry out multiclass classification task;
— split the dataset 90/10% for training and testing;
— train the Random Forest Classifier model employing the bootstrapping technique, 10 estimators and using entropy as a criterion;
— test the model.
We compare our lemmatization algorithm with the UDPipe system as a baseline. The baseline UDPipe system [38] is an updated version of the UDPipe. Both UDPipe versions have a lemmatizer based on the edit-tree classification method.
We use UDPipe Future as one of the top performing entries in the lemmatization evaluation. Its performance in the CoNLL 2018 UD Shared task was ranked 1st, 3rd and 3rd in the three official metrics: MLAS, LAS and BLEX, respectively.
4 Results
4.1 Model Selection
For selecting the best model, we compared four tree classifier algorithms: Extra Trees [14], Extra Tree, Decision Tree [7] and Random Forest [6].
As was already mentioned, we compare only tree classification algorithms, because only these algorithms and the K-Nearest Neighbors with Radius Neighbors algorithms are compatible with multiclass-multioutput tasks in the Python sklearn module implementation [31]. However, the K-Nearest Neighbors algorithms requires feature dimensionality reduction and significant amount of time to test on large datasets, so we omitted them.
As Table 3 shows, Random Forest Algorithm holds the majority of the leading testset accuracy results, and followed immediately by the Extra Trees algorithm. On average, both algorithms result in the same 0.72 accuracy score.
Language | ExtraTrees | ExtraTree | DecisionTree | RandomForest |
Manx Gaelic | 0.39 | 0.33 | 0.39 | 0.39 |
Farsi | 0.40 | 0.28 | 0.31 | 0.38 |
Scottish Gaelic | 0.47 | 0.38 | 0.44 | 0.45 |
Estonian | 0.45 | 0.36 | 0.41 | 0.47 |
Czech | 0.46 | 0.40 | 0.44 | 0.48 |
Bulgarian | 0.48 | 0.42 | 0.45 | 0.50 |
English | 0.50 | 0.31 | 0.40 | 0.48 |
Hungarian | 0.50 | 0.44 | 0.46 | 0.51 |
Asturian | 0.71 | 0.63 | 0.66 | 0.71 |
Irish | 0.73 | 0.66 | 0.75 | 0.73 |
Slovenian | 0.74 | 0.67 | 0.70 | 0.74 |
German | 0.76 | 0.68 | 0.69 | 0.74 |
Romanian | 0.78 | 0.68 | 0.77 | 0.79 |
Russian | 0.79 | 0.75 | 0.77 | 0.79 |
French | 0.81 | 0.74 | 0.77 | 0.81 |
Portuguese | 0.86 | 0.81 | 0.84 | 0.87 |
Spanish | 0.87 | 0.80 | 0.84 | 0.87 |
Welsh | 0.87 | 0.85 | 0.87 | 0.88 |
Galician | 0.89 | 0.83 | 0.86 | 0.89 |
Catalan | 0.89 | 0.86 | 0.87 | 0.89 |
Swedish | 0.89 | 0.81 | 0.84 | 0.88 |
Ukrainian | 0.90 | 0.85 | 0.88 | 0.91 |
Italian | 0.91 | 0.86 | 0.88 | 0.91 |
Slovak | 0.92 | 0.85 | 0.89 | 0.92 |
Turkish | 0.95 | 0.90 | 0.96 | 0.96 |
Average | 0.72 | 0.65 | 0.69 | 0.72 |
To make an informed choice between the two algorithms, we calculated the weighted average testset accuracy score, weighing by the number of words available for each language.
Table 4 shows that the leader is Random Forest with its weighted average test sample accuracy score of 0.8405, leaving the Extra Trees algorithm behind with its 0.8396 score.
Language | Num. words | Weight | ExtraTrees | ExtraTree | DecisionTree | RandomForest |
Manx Gaelic | 67,177 | 0.6% | 0.0025 | 0.0021 | 0.0025 | 0.0025 |
Farsi | 6,273 | 0.1% | 0.0002 | 0.0002 | 0.0002 | 0.0002 |
Scottish Gaelic | 51,624 | 0.5% | 0.0023 | 0.0019 | 0.0021 | 0.0022 |
Estonian | 80,536 | 0.8% | 0.0034 | 0.0027 | 0.0031 | 0.0035 |
Czech | 36,400 | 0.3% | 0.0016 | 0.0014 | 0.0015 | 0.0016 |
Bulgarian | 30,323 | 0.3% | 0.0014 | 0.0012 | 0.0013 | 0.0014 |
English | 41,649 | 0.4% | 0.0020 | 0.0012 | 0.0016 | 0.0019 |
Hungarian | 39,898 | 0.4% | 0.0019 | 0.0017 | 0.0017 | 0.0019 |
Asturian | 108,792 | 1.0% | 0.0072 | 0.0064 | 0.0068 | 0.0072 |
Irish | 415,502 | 3.9% | 0.0286 | 0.0258 | 0.0294 | 0.0284 |
Slovenian | 99,063 | 0.9% | 0.0069 | 0.0062 | 0.0065 | 0.0069 |
German | 358,473 | 3.4% | 0.0257 | 0.0229 | 0.0233 | 0.0250 |
Romanian | 314,810 | 3.0% | 0.0230 | 0.0200 | 0.0228 | 0.0233 |
Russian | 2,657,468 | 25.0% | 0.1982 | 0.1867 | 0.1932 | 0.1987 |
French | 223,999 | 2.1% | 0.0171 | 0.0156 | 0.0163 | 0.0171 |
Portuguese | 850,264 | 8.0% | 0.0690 | 0.0644 | 0.0669 | 0.0693 |
Spanish | 496,591 | 4.7% | 0.0405 | 0.0373 | 0.0392 | 0.0405 |
Welsh | 359,224 | 3.4% | 0.0295 | 0.0286 | 0.0295 | 0.0297 |
Galician | 392,856 | 3.7% | 0.0328 | 0.0306 | 0.0316 | 0.0329 |
Catalan | 591,534 | 5.6% | 0.0494 | 0.0476 | 0.0484 | 0.0495 |
Swedish | 675,137 | 6.4% | 0.0567 | 0.0518 | 0.0536 | 0.0562 |
Ukrainian | 193,704 | 1.8% | 0.0165 | 0.0156 | 0.0160 | 0.0165 |
Italian | 341,074 | 3.2% | 0.0291 | 0.0275 | 0.0284 | 0.0291 |
Slovak | 858,414 | 8.1% | 0.0742 | 0.0684 | 0.0722 | 0.0743 |
Turkish | 1,337,898 | 12.6% | 0.1201 | 0.1135 | 0.1203 | 0.1207 |
Total | 10 628 683 | 100.0% | 0.8396 | 0.7813 | 0.8183 | 0.8405 |
To measure the affinity of the compared algorithms for each language, we calculated the correlation coefficient between the results of each pair of algorithms for test sample accuracy scores by language as
where
As can be seen from Table 5, the results yielded by each individual algorithm we compared are highly correlated between each other, so there is no any obvious preference of the algorithms for any specific language.
4.2 Experiments
The results we have obtained using the Random Forest classifier models for 25 languages are shown in Fig. 2. As can be seen from the figure, there is a clear dependency of test accuracy from the volume of the dataset the model was trained on, thus Manx Gaelic, Farsi, Scottish Gaelic, Estonian, Czech, Bulgarian, English and Hungarian languages score the lowest for the test accuracy (from 0.39 to 0.51) and at the same time for these languages we had less data available.
We can also observe that despite the largest dataset available, we obtained relatively low test accuracy score on Russian language (more than 2.3 million word pairs, and test accuracy of 0.79), which we can attribute to the grammar complexity of the Russian language. Another fact is that in the top five languages by test accuracy, there are two Slavic languages, namely Ukrainian (193 704 word pairs, test accuracy 0.91) and Slovak (858 414 word pairs, test accuracy 0.92), sharing the group with Czech (low amount of data: 36,400 word pairs, test accuracy score 0.48) and Russian (complex grammar) that they overrun significantly.
Considering the Turkish language as having the best result for test accuracy score (0.96) and having substantial amount of data (1.3 million word), we must note that the dictionary we had for this language was essentially wordform–stem dictionary, and that is why we can disqualify it but making the point that our algorithm might be exceptionally good for stemming tasks. It is also worth mentioning that our baseline, UDPipe, when used for Turkish language, instead of lemma gives the stem.
Another factor that might be affecting the test accuracy we have is the maximum length of the word for a given language. In Table 6, one can observe the correlation coefficient (9) of 0.28 between this length and test accuracy score. The maximum length of the word in a language may indicate the presence of a set of grammar rules regulating the construction of words, and these rules can be generalized by the Random Forest Classifier Algorithm if they do not have many exceptions.
Testaccuracy | Trainaccuracy | Maxwordlength | Numofletters | Wordpairsnumber | |
Test accuracy | 1 | 0.64 | 0.28 | −0.03 | 0.49 |
Train accuracy | 1 | 0.05 | 0.10 | −0.01 | |
Max word length | 1 | 0.11 | 0.21 | ||
Number of letters | 1 | −0.28 | |||
Word pairs number | 1 |
Again, for this explanation we have a contradicting Romanian language which has relatively large dataset of more than 300 thousand word pairs, longest word of 53 letters (longer than the Turkish language longest word of 50 letters) and scoring only 0.79 for test accuracy, neighboring with the Russian language on the scale. Other Romance group languages such as Italian, Catalan, Galician, Spanish, Portuguese and French scored on the range of 0.81 to 0.91 for test accuracy, and we can conclude that Romance languages are positively responsive to our lemmatization method except for the Romanian language.
4.3 Comparison with UDPipe
Comparing the test samples scoring with lemmatization results over the UDPipe API yielded the results shown in Fig. 3.
UDPipe has no models for Manx Gaelic, Asturian and Welsh languages, so we were able to make comparison only on the rest of the languages.
Our lemmatizer outperformed the UDPipe models on all languages except for Farsi, Estonian, English and Hungarian. The languages on which our lemmatizer performed badly supposedly had insufficient training data (see Figure 3), which can be fixed in the future.
5 Conclusion and Future Work
The lemmatization method presented in this paper showed good potential for the use on different languages from different language groups, and is worth further development on larger datasets of the tested languages, as well as Asian and African languages.
For future work, we plan exploring the feature importance for different languages, such as what parts of the word are deemed more significant in a language for inducing a words normal form.
In addition, building and possible interpreting the decision tree diagrams built for each language by the algorithm can be a very important step towards improving the accuracy of the algorithm.