Servicios Personalizados
Revista
Articulo
Indicadores
-
Citado por SciELO
-
Accesos
Links relacionados
-
Similares en SciELO
Compartir
Polibits
versión On-line ISSN 1870-9044
Polibits no.38 México jul./dic. 2008
Special section: natural language processing
Named Entity Recognition in Hindi using Maximum Entropy and Transliteration
Sujan Kumar Saha1, Partha Sarathi Ghosh2, Sudeshna Sarkar3, and Pabitra Mitra4
1 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: sujan.kr.saha@gmail.com).
2 HCL Technologies, Bangalore, India (email: partha.silicon@gmail.com).
3 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: shudeshna@gmail.com).
4 Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: pabitra@gmail.com).
Manuscript received July 10, 2008.
Manuscript accepted for publication October 22, 2008.
Abstract
Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resourcepoor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a twophase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.
Key words: Gazetteer list preparation, named entity recognition, natural language processing, transliteration.
DESCARGAR ARTÍCULO EN FORMATO PDF
REFERENCES
[1] AlOnaizan Y. and Knight K. 2002. Machine Transliteration of Names in Arabic Text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. [ Links ]
[2] Bikel D. M., Miller S, Schwartz R and Weischedel R. 1997. Nymble: A high performance learning namefinder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194201. [ Links ]
[3] Borthwick A. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New Fork University. [ Links ]
[4] Crego J. M., Marino J. B. and Gispert A. 2005. Reordered Search and Tuple Unfolding for Ngrambased SMT. In: Proceedings of the MTSummitX, Phuket, Thailand, pp. 283289. [ Links ]
[5] Cucerzan S. and Yarowsky D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the Joint SIGDAT Conference on EMNLP and VLC 1999, pp. 9099. [ Links ]
[6] Darroch J. N. and Ratcliff D. 1972. Generalized iterative scaling for loglinear models. Annals of Mathematical Statistics, pp. 43(5):14701480. [ Links ]
[7] Ekbal A., Naskar S. and Bandyopadhyay S. 2006. A Modified Joint Source Channel Model for Transliteration. In Proceedings of the COLING/ACL 2006, Australia, pp. 191198. [ Links ]
[8] Goto I., Kato N., Uratani N. and Ehara T. 2003. Transliteration considering Context Information based on the Maximum Entropy Method. In: Proceeding of the MTSummit IX, New Orleans, USA, pp. 125132. [ Links ]
[9] Grishman R. 1995. Where's the syntax? The New York University MUC6 System. In: Proceedings of the Sixth Message Understanding Conference. [ Links ]
[10] Knight K. and Graehl J. 1998. Machine Transliteration. Computational Linguistics, 24(4): 599612. [ Links ]
[11] Li H., Zhang M. and Su J. 2004. A Joint SourceChannel Model for Machine Transliteration. In: Proceedings of the 42nd Annual Meeting of the ACL, Barcelona, Spain, (2004), pp. 159166. [ Links ]
[12] Li W. and McCallum A. 2003. Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction. In: ACM Transactions on Asian Language Information Processing (TALIP), 2(3): 290294. [ Links ]
[13] McDonald D. 1996. Internal and external evidence in the identification and semantic categorization of proper names. In: B.Boguraev and J. Pustejovsky (eds), Corpus Processing for Lexical Acquisition, pp. 2139. [ Links ]
[14] Mikheev A, Grover C. and Moens M. 1998. Description of the LTG system used for MUC7. In Proceedings of the Seventh Message Understanding Conference. [ Links ]
[15] Pietra S. D., Pietra V. D. and Lafferty J. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 19(4): 380393. [ Links ]
[16] Saha S. K., Mitra P. and Sarkar S. 2008. Word Clustering and Word Selection based Feature Reduction for MaxEnt based Hindi NER. In: proceedings of ACL08: HLT, pp. 488495. [ Links ]
[17] Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and SubType Tagging. In: Proceedings of the sixth conference on applied natural language processing. [ Links ]
[18] Wakao T., Gaizauskas R. and Wilks Y. 1996. Evaluation of an algorithm for the recognition and classification of proper names. In: Proceedings of COLING96 [ Links ]