Cross Script Hindi English NER Corpus from Wikipedia
release_zvq6trwx3vdc5iarfhw54m5gzm
by
Mohd Zeeshan Ansari, Tanvir Ahmad, Md Arshad Ali
2018
Abstract
The text generated on social media platforms is essentially a mixed lingual
text. The mixing of language in any form produces considerable amount of
difficulty in language processing systems. Moreover, the advancements in
language processing research depends upon the availability of standard corpora.
The development of mixed lingual Indian Named Entity Recognition (NER) systems
are facing obstacles due to unavailability of the standard evaluation corpora.
Such corpora may be of mixed lingual nature in which text is written using
multiple languages predominantly using a single script only. The motivation of
our work is to emphasize the automatic generation such kind of corpora in order
to encourage mixed lingual Indian NER. The paper presents the preparation of a
Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora
is successfully annotated using standard CoNLL-2003 categories of PER, LOC,
ORG, and MISC. Its evaluation is carried out on a variety of machine learning
algorithms and favorable results are achieved.
In text/plain
format
Archived Files and Locations
application/pdf 392.0 kB
file_ei32i7rey5bu7grfhvnlbix7zi
|
arxiv.org (repository) web.archive.org (webarchive) |
1810.03430v1
access all versions, variants, and formats of this works (eg, pre-prints)