SciELO - Scientific Electronic Library Online

 
vol.24 número3Automatic Detection of Lexical Functions in Context índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Computación y Sistemas

versão On-line ISSN 2007-9737versão impressa ISSN 1405-5546

Resumo

AKHMETOV, Iskander; PAK, Alexandr; UALIYEVA, Irina  e  GELBUKH, Alexander. Highly Language-Independent Word Lemmatization Using a Machine-Learning Classifier. Comp. y Sist. [online]. 2020, vol.24, n.3, pp.1353-1364.  Epub 09-Jun-2021. ISSN 2007-9737.  https://doi.org/10.13053/cys-24-3-3775.

Lemmatization is a process of finding the base morphological form (lemma) of a word. It is an important step in many natural language processing, information retrieval, and information extraction tasks, among others. We present an open-source language-independent lemmatizer based on the Random Forest classification model. This model is a supervised machine-learning algorithm with decision trees that are constructed corresponding to the grammatical features of the language. This lemmatizer does not require any manual work for hard-coding of the rules, and at the same time it is simple and interpretable. We compare the performance of our lemmatizer with that of the UDPipe lemmatizer on twenty-two out of twenty-five languages we work on for which UDPipe has models. Our lemmatization method shows good performance on different languages from various language groups, and it is easily extensible to other languages. The source code of our lemmatizer is publicly available.

Palavras-chave : Lemmatization; natural language processing; text preprocessing; Random Forest classifier; Decision Tree classifier.

        · texto em Inglês     · Inglês ( pdf )