Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

Kumar Saha, Sujan; Sarathi Ghosh, Partha; Sarkar, Sudeshna; Mitra, Pabitra

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Polibits

versión On-line ISSN 1870-9044

Polibits no.38 México jul./dic. 2008

Special section: natural language processing

Named Entity Recognition in Hindi using Maximum Entropy and Transliteration

Sujan Kumar Saha¹, Partha Sarathi Ghosh², Sudeshna Sarkar³, and Pabitra Mitra⁴

¹ Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: sujan.kr.saha@gmail.com).

² HCL Technologies, Bangalore, India (email: partha.silicon@gmail.com).

³ Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: shudeshna@gmail.com).

⁴ Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India (email: pabitra@gmail.com).

Manuscript received July 10, 2008.
Manuscript accepted for publication October 22, 2008.

Abstract

Named entities are perhaps the most important indexing element in text for most of the information extraction and mining tasks. Construction of a Named Entity Recognition (NER) system becomes challenging if proper resources are not available. Gazetteer lists are often used for the development of NER systems. In many resource–poor languages gazetteer lists of proper size are not available, but sometimes relevant lists are available in English. Proper transliteration makes the English lists useful in the NER tasks for such languages. In this paper, we have described a Maximum Entropy based NER system for Hindi. We have explored different features applicable for the Hindi NER task. We have incorporated some gazetteer lists in the system to increase the performance of the system. These lists are collected from the web and are in English. To make these English lists useful in the Hindi NER task, we have proposed a two–phase transliteration methodology. A considerable amount of performance improvement is observed after using the transliteration based gazetteer lists in the system. The proposed transliteration based gazetteer preparation methodology is also applicable for other languages. Apart from Hindi, we have applied the transliteration approach in Bengali NER task and also achieved performance improvement.

Key words: Gazetteer list preparation, named entity recognition, natural language processing, transliteration.

DESCARGAR ARTÍCULO EN FORMATO PDF

REFERENCES

[1] Al–Onaizan Y. and Knight K. 2002. Machine Transliteration of Names in Arabic Text. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. [ Links ]

[2] Bikel D. M., Miller S, Schwartz R and Weischedel R. 1997. Nymble: A high performance learning name–finder. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201. [ Links ]

[3] Borthwick A. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New Fork University. [ Links ]

[4] Crego J. M., Marino J. B. and Gispert A. 2005. Reordered Search and Tuple Unfolding for Ngram–based SMT. In: Proceedings of the MT–SummitX, Phuket, Thailand, pp. 283–289. [ Links ]

[5] Cucerzan S. and Yarowsky D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the Joint SIGDAT Conference on EMNLP and VLC 1999, pp. 90–99. [ Links ]

[6] Darroch J. N. and Ratcliff D. 1972. Generalized iterative scaling for log–linear models. Annals of Mathematical Statistics, pp. 43(5):1470–1480. [ Links ]

[7] Ekbal A., Naskar S. and Bandyopadhyay S. 2006. A Modified Joint Source Channel Model for Transliteration. In Proceedings of the COLING/ACL 2006, Australia, pp. 191–198. [ Links ]

[8] Goto I., Kato N., Uratani N. and Ehara T. 2003. Transliteration considering Context Information based on the Maximum Entropy Method. In: Proceeding of the MT–Summit IX, New Orleans, USA, pp. 125–132. [ Links ]

[9] Grishman R. 1995. Where's the syntax? The New York University MUC–6 System. In: Proceedings of the Sixth Message Understanding Conference. [ Links ]

[10] Knight K. and Graehl J. 1998. Machine Transliteration. Computational Linguistics, 24(4): 599–612. [ Links ]

[11] Li H., Zhang M. and Su J. 2004. A Joint Source–Channel Model for Machine Transliteration. In: Proceedings of the 42^nd Annual Meeting of the ACL, Barcelona, Spain, (2004), pp. 159–166. [ Links ]

[12] Li W. and McCallum A. 2003. Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction. In: ACM Transactions on Asian Language Information Processing (TALIP), 2(3): 290–294. [ Links ]

[13] McDonald D. 1996. Internal and external evidence in the identification and semantic categorization of proper names. In: B.Boguraev and J. Pustejovsky (eds), Corpus Processing for Lexical Acquisition, pp. 21–39. [ Links ]

[14] Mikheev A, Grover C. and Moens M. 1998. Description of the LTG system used for MUC–7. In Proceedings of the Seventh Message Understanding Conference. [ Links ]

[15] Pietra S. D., Pietra V. D. and Lafferty J. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 19(4): 380–393. [ Links ]

[16] Saha S. K., Mitra P. and Sarkar S. 2008. Word Clustering and Word Selection based Feature Reduction for MaxEnt based Hindi NER. In: proceedings of ACL–08: HLT, pp. 488–495. [ Links ]

[17] Srihari R., Niu C. and Li W. 2000. A Hybrid Approach for Named Entity and Sub–Type Tagging. In: Proceedings of the sixth conference on applied natural language processing. [ Links ]

[18] Wakao T., Gaizauskas R. and Wilks Y. 1996. Evaluation of an algorithm for the recognition and classification of proper names. In: Proceedings of COLING–96 [ Links ]