Application of the LDA Model for Obtaining Topics from the WIKICORPUS

Martínez Guzmán, Gerardo; Bernábe Loranca, María Beatriz; Cerón Garnica, Carmen; Serrano Pérez, Jonathan; Archundia Sierra, Etelvina

doi:10.13053/cys-26-1-4171

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Permalink

Computación y Sistemas

versión On-line ISSN 2007-9737versión impresa ISSN 1405-5546

Resumen

MARTINEZ GUZMAN, Gerardo et al. Application of the LDA Model for Obtaining Topics from the WIKICORPUS. Comp. y Sist. [online]. 2022, vol.26, n.1, pp.281-293. Epub 08-Ago-2022. ISSN 2007-9737. https://doi.org/10.13053/cys-26-1-4171.

A fundamental problem in text analysis of great amount of information is to discover the topics described in the documents. One of the most useful application involves the extraction of topics from documents corpus. Such is the case of Wikicorpus that consists of approximately 250,000 documents totaling in 250 millions of words. In this work, a system based on the Latent Dirichlet Allocation (LDA) model has been developed to carry out the task of automatically selecting the words of the corpus and, based on their frequency in the documents, it would indicate that they may or not belong to certain topic, classifying words without human intervention. Due to the large amount of information of the corpus, a Serial-Parallel Algorithm (SPA) in C/C++ and OpenMP have been used to perform parallel programming, since in parallel stages all threads must share certain variables, so the design architecture was shared memory.

Palabras llave : Corpus; generative model; Dirichlet distribution; latent topics; parallelization; algorithm; C/C++ programming.

· texto en Inglés · Inglés (

pdf )