Short Text Document Clustering using Distributed Word Representation and Document Distance release_5l2qhoyeffeupecygpsj7bhe7e

by Supavit KONGWUDHIKUNAKORN, Kitsana WAIYAMAI

Published in Walailak Journal of Science and Technology by College of Graduate Studies, Walailak University.

Volume 16p107-119 (2018)

Abstract

This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.
In application/xml+jats format

Archived Files and Locations

application/pdf  755.6 kB
file_htazoj6wzjdxho7aqxi5daovee
web.archive.org (webarchive)
wjst.wu.ac.th (publisher)
Read Archived PDF
Archived
Type  article-journal
Stage   published
Date   2018-03-26
Journal Metadata
Open Access Publication
In DOAJ
In ISSN ROAD
Not in Keepers Registry
ISSN-L:  1686-3933
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 7418dbf7-0fab-492c-8480-a5dc3bcbe5e2
API URL: JSON