Cross Script Hindi English NER Corpus from Wikipedia release_zvq6trwx3vdc5iarfhw54m5gzm

by Mohd Zeeshan Ansari, Tanvir Ahmad, Md Arshad Ali

Released as a article .

2018  

Abstract

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.
In text/plain format

Archived Files and Locations

application/pdf  392.0 kB
file_ei32i7rey5bu7grfhvnlbix7zi
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2018-10-08
Version   v1
Language   en ?
arXiv  1810.03430v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: eb6eb60d-7e37-4503-aef5-6f284eb2c597
API URL: JSON