SIMTEX: An Approach for Detecting and Measuring Textual Similarity based on Discourse and Semantics


Iria da Cunha1, Jorge Vivaldi1, Juan-Manuel Torres-Moreno2,3, and Gerardo Sierra2,4


1 University Institute for Applied Linguistics (Universitat Pompeu Fabra), Barcelona, Spain.,,

2 LIA/Agorantic/Université d'Avignon et des Pays de Vaucluse, Avignon, France.

3 École Poytechnique de Montréal, Montréal, (Quebec) Canada.

4 Universidad Nacional Autónoma de México/Instituto de Ingeniería, Mexico DF, México.,


Article received on 20/01/2014.
Accepted on 21/03/2014.



Nowadays automatic systems for detecting and measuring textual similarity are being developed, in order to apply them to different tasks in the field of Natural Language Processing (NLP). Currently, these systems use surface linguistic features or statistical information. Nowadays, few researchers use deep linguistic information. In this work, we present an algorithm for detecting and measuring textual similarity that takes into account information offered by discourse relations of Rhetorical Structure Theory (RST), and lexical-semantic relations included in EuroWordNet. We apply the algorithm, called SIMTEX, to texts written in Spanish, but the methodology is potentially language-independent.

Keywords: Textual similarity, discourse, semantics, paraphrase.





We acknowledge the Mexico's National Council of Science and Technology (Conacyt) grant number 178248 and Project UNAM-DGAPA-PAPIIT number IN400312. We also acknowledge the support of the Spanish projects RICOTERM 4 (FFI201021365-C03-01) and APLE 2 (FFI2012-37260), a Juan de la Cierva grant (JCI-2011-09665) and an Ibero-America Young Teachers and Researchers Santander Grant 2013.



