SciELO - Scientific Electronic Library Online

 
 número41Spoken to Spoken vs. Spoken to Written: Corpus Approach to Exploring Interpreting and SubtitlingA Natural Language Dialogue System for Impression-based Music Retrieval índice de autoresíndice de assuntospesquisa de artigos
Home Pagelista alfabética de periódicos  

Serviços Personalizados

Journal

Artigo

Indicadores

Links relacionados

  • Não possue artigos similaresSimilares em SciELO

Compartilhar


Polibits

versão On-line ISSN 1870-9044

Resumo

SINGH, Thoudam Doren  e  BANDYOPADHYAY, Sivaji. Semi-Automatic Parallel Corpora Extraction from Comparable News Corpora. Polibits [online]. 2010, n.41, pp.11-17. ISSN 1870-9044.

The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English from a comparable news corpora collected from the web. A medium sized Manipuri-English bilingual lexicon and another list of Manipuri-English transliterated entities have been developed and used in the present work. Using morphological information for the agglutinative and inflective Manipuri language, the alignment quality based on similarity measure is further improved. A high level of performance is desirable since errors in sentence alignment cause further errors in systems that use the aligned text. The system has been evaluated and error analysis has also been carried out. The technique shows its effectiveness in Manipuri-English language pair and is extendable to other resource constrained, agglutinative and inflective Indian languages.

Palavras-chave : Parallel corpora; similarity measure; bilingual lexicon; morphology; named entity list.

        · texto em Inglês

 

Creative Commons License Todo o conteúdo deste periódico, exceto onde está identificado, está licenciado sob uma Licença Creative Commons