Document Indexing with a Concept Hierarchy

Gelbukh, Alexander; Sidorov, Grigori; Guzmán-Arenas, Adolfo

Services on Demand

Journal

Article

Indicators

Cited by SciELO
Access statistics

Computación y Sistemas

On-line version ISSN 2007-9737Print version ISSN 1405-5546

Comp. y Sist. vol.8 n.4 Ciudad de México Apr./Jun. 2005

Artículos

Document Indexing with a Concept Hierarchy

Índice de Documentos con una Jerarquía de Conceptos

Alexander Gelbukh, Grigori Sidorov and Adolfo Guzmán–Arenas

Natural Language Processing Laboratory,
Center for Computing Research (CIC), National Polytechnic Institute (IPN),
Av. Juan de Dios Bátiz s/n, Esq. Mendizábal, Col. Zacatenco, CP 07738, DF, México.

E–mail: gelbukh@gelbukh.com, sidorov@cic.ipn.mx, a.guzman@acm.org

www.Gelbukh.com

Article received on april13, 2004; accepted on march 15, 2005

Abstract

Given a large hierarchical concept dictionary (thesaurus, or ontology), the task of selection of the concepts that describe the contents of a given document is considered. A statistical method of document indexing driven by such a dictionary is proposed. The method is insensible to inaccuracies in the dictionary, which allow for semi–automatic translation of the hierarchy into difíerent languages. The problem of handling non–terminal and especially top–level nodes in the hierarchy is discussed. Common sense–complaint methods of automatically assigning the weights to the nodes and links in the hierarchyare presented. The application of the method in the Classifier system is discussed.

Keywords: Document Characterization, Document Comparison, Ontology, Statistical Methods.

Resumen

Se considera la tarea de la selección de los conceptos que describen el contenido de un documento dado. Los conceptos se eligen de un diccionario. jerárquico grande (un tesauro, o bien una ontología). Se propone un método estadístico para crear un índice de los documentos, guiado por tal diccionario. El método es robusto en cuanto a los errores en el diccionario, lo que permite traducir tal diccionario semiautomáticamente en varios lenguajes. Se discute el problema del uso de los nodos no terminales y especialmente de los nodos de alto nivel en la jerarquía. Se presentan los métodos para ponderación automática de los nodos y vínculos en la jerarquía de la manera en que coincide con los criterios del sentido común. Se discute la aplicación del método en el sistema Classifier.

Palabras Clave: Caracterización de Documentos, Comparación de Documentos. Ontología, Métodos Estadísticos.

DESCARGAR ARTICULO EN FORMATO PDF

Acknowledgments

The work was partially supported by Mexican Government (SNI, CONACyT, CGPI–IPN).

References

1. Apté Ch; F. Damerau, and Sh. M. Weiss, "Automated learning of decision rules for text categorization". ACM Transactions on Information Systems. Vol. 12, No. 3 (July 1994), pp. 233–251. [ Links ]

2. Bharat K. and M. Henzinger, "Improved algorithms for topic distillation in hyper–linked environments", 21^st International ACM SIGIR Conference, 1998. [ Links ]

3. Cassidy P., "An Investigation of the Semantic Relations in the Roget's Thesaurus: Preliminary results", In: Proc. ClCLing–2000, International Conference on Intelligent Text Processing and Computational Linguistics, IPN, Mexico, 2000, 181–204. [ Links ]

4. Chakrabarti S.; B. Dom, R. Agrawal, and P. Raghavan "Using taxonomy, discriminants, and signatures for navigating in text databases", 23^rd VLDB Conference, Athenas, Greece, 1997. [ Links ]

5. Cohen W. and Y. Singer, "Context–sensitive Learning Methods for Text Categorization", Proc. of SIGIR'96, 1996. [ Links ]

6. Feldman R. and I. Dagan, "Knowledge Discovery in Textual Databases", Knowledge Discovery and Data Mining, Montreal, Canada, 1995. [ Links ]

7. Gelbukh A., "Using a semantic network for lexical and syntactic disambiguation", Proc. of Simposium Internacional de Computación: Nuevas Aplicaciones e Innovaciones Tecnológicas en Computación, November 1997, Mexico. [ Links ]

8. Gelbukh A., "Syntactic disambiguation with weighted extended subcategorization frames". Proc. PACLlNG–99, Pacific Association for Computational Linguistics, Canada, pp. 244–249. [ Links ]

9. Gelbukh A., G. Sidorov, and A. Guzmán–Arenas, "Document comparison with a weighted topic hierarchy", Proc. 1^stInternational Workshop on Document Analysis and Understanding for Document Databases (DAUDD'99), 10^th International Conference and Workshop on Database and Expert Systems Applications (DEXA), Florence, Italy, September 1, 1999. IEEE Computer Society Press, pp. 566–570. [ Links ]

10. Gelbukh A., G. Sidorov, and A. Guzmán–Arenas, "A Method of Describing Document Contents through Topic Selection". Proc. of SPIRE'99,Internalional Symposium on String Processing and Information Retrieval, Cancun, Mexico, September 22–24. IEEE Computer Society Press, 1999, pp. 73–80. [ Links ]

11. Guzmán–Arenas A., "Finding the main themes in a Spanish document", Expert Systems with Applications, Vol. 14, No. 1/2, Jan/Feb 1998, pp. 139–148. [ Links ]

12. Guzmán–Arenas A., "Hallando los temas principales en un artículo en español," Soluciones Avanzadas. 1997, Vol. 5, , No. 45, p. 58, No. 49, p. 66. [ Links ]

13. Hyötyniemi H., "Text Document Classification with Self–Organizing Maps", in STeP'96, Genes, Nets and Symbols, Alander, J.; Honkela, T.; Jakobsson, M. (eds.), Finnish Artificial Intelligence Society, 1996, pp. 64–72. [ Links ]

14. Koller D. and M. Sahami, "Hierarchically classifying documents using very few words", International Conference on Machine Learning, 1997, pp. 170–178. [ Links ]

15. Krowetz B. "Homonymy and Polysemy in Information Retrieval", 35th Annual Meeting of the Association for Computational Linguistics, 1997, pp. 72–79 [ Links ]

16. Le D.X., G. Thoma and H. Weschler, "Document Classification using Connectionist Models", IEEE International Conference on Neural Networks, Orlando, FL, June 28 – July 2, 1994, Vol. 5, pp. 3009–3014. [ Links ]

17. Light J., "A distributed, graphical, topic–oriented document search system" CIKM '97, Proceedings of the sixth international conference on Information and knowledge management, 1997, pp. 285–292. [ Links ]

18. Niwa Y., Sh. Nishioka, M. Iwayama, A. Takano, and Y. Nitta, "Topie Graph Generation for Query Navigation: Use of Frequeney Classes for Topie Extraetion", NLPRS'97, Natural Language Processing Pacific Rim Symposium '97, Phuket, Thailand, Dee. 1997, pp. 95–100. [ Links ]

19. Ponte J. M. and W. B. Croft, "Text Segmentation by Topic", First European Conference on Research and Advanced Technology for Digital Libraries, 1997, pp. 113–125. [ Links ]

20. Seymore K. and R. Rosenfeld, "Using story topics for language model adaptation", Proc. 01 Eurospeech '97, 1997. [ Links ]

21. WORDNET, Coling–ACL'98 Workshop: Usage of WordNet in Natural Language Processing Systems. August 16, 1998, Université de Montréal, Montréal, Canada. [ Links ]