Language Segmentation
release_kzx6c3evfvg67jdbqhboaqfag4
by
David Alfter
2015
Abstract
Language segmentation consists in finding the boundaries where one language
ends and another language begins in a text written in more than one language.
This is important for all natural language processing tasks. The problem can be
solved by training language models on language data. However, in the case of
low- or no-resource languages, this is problematic. I therefore investigate
whether unsupervised methods perform better than supervised methods when it is
difficult or impossible to train supervised approaches. A special focus is
given to difficult texts, i.e. texts that are rather short (one sentence),
containing abbreviations, low-resource languages and non-standard language. I
compare three approaches: supervised n-gram language models, unsupervised
clustering and weakly supervised n-gram language model induction. I devised the
weakly supervised approach in order to deal with difficult text specifically.
In order to test the approach, I compiled a small corpus of different text
types, ranging from one-sentence texts to texts of about 300 words. The weakly
supervised language model induction approach works well on short and difficult
texts, outperforming the clustering algorithm and reaching scores in the
vicinity of the supervised approach. The results look promising, but there is
room for improvement and a more thorough investigation should be undertaken.
In text/plain
format
Archived Files and Locations
application/pdf 598.2 kB
file_zozyopow6jaxrj2psadrg64pve
|
arxiv.org (repository) web.archive.org (webarchive) |
1510.01717v1
access all versions, variants, and formats of this works (eg, pre-prints)