Topic Identification Of Noisy Texts: Statistical Approaches
release_oicdjqsqenhojo6luhaz2w3i2q
by
K. Abainia
2015
Abstract
This paper deals with the problem of automatic theme identification of noisy Arabic texts. Actually, there exist several works in this field based on statistical and machine learning approaches for different text categories. Unfortunately, most of the proposed approaches are suitable in clean and long texts. In this investigation, we carried out a comparative study between two different statistical approaches based on tf-idf. Hence, different configurations were used in both approaches to provide a large comparison. Furthermore, an in-house corpus called ANTSIX was created to evaluate the proposed approaches, which contains discussion forum texts related to 6 different topics.
Experimental results show that the two statistical approaches are suitable for topic identification of noisy Arabic texts, but each technique has advantages and drawbacks.
In text/plain
format
Archived Files and Locations
application/pdf 177.9 kB
file_bpjcqnajpfgx3g5ajfxtto5cm4
|
scholarpage.org (web) web.archive.org (webarchive) |
article-journal
Stage
published
Date 2015-06-01
access all versions, variants, and formats of this works (eg, pre-prints)
Datacite Metadata (via API)
Worldcat
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar