Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to Reduce Redundancy release_qhu772l35zaqxntyskuynkl3ve

by Waheeb, Khan, Chen, Shang

Published in Information by MDPI AG.

2020   Volume 11, p59

Abstract

Arabic is one of the most semantically and syntactically complex languages in the world. A key challenging issue in text mining is text summarization, so we propose an unsupervised score-based method which combines the vector space model, continuous bag of words (CBOW), clustering, and a statistically-based method. The problems with multidocument text summarization are the noisy data, redundancy, diminished readability, and sentence incoherency. In this study, we adopt a preprocessing strategy to solve the noise problem and use the word2vec model for two purposes, first, to map the words to fixed-length vectors and, second, to obtain the semantic relationship between each vector based on the dimensions. Similarly, we use a k-means algorithm for two purposes: (1) Selecting the distinctive documents and tokenizing these documents to sentences, and (2) using another iteration of the k-means algorithm to select the key sentences based on the similarity metric to overcome the redundancy problem and generate the initial summary. Lastly, we use weighted principal component analysis (W-PCA) to map the sentences' encoded weights based on a list of features. This selects the highest set of weights, which relates to important sentences for solving incoherency and readability problems. We adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as an evaluation measure to examine our proposed technique and compare it with state-of-the-art methods. Finally, an experiment on the Essex Arabic Summaries Corpus (EASC) using the ROUGE-1 and ROUGE-2 metrics showed promising results in comparison with existing methods.
In application/xml+jats format

Archived Files and Locations

application/pdf  1.2 MB
file_ze7kgwqemnftxlswl6mapaczme
res.mdpi.com (publisher)
web.archive.org (webarchive)
application/pdf  2.2 MB
file_6edocjopm5arrn44fesfvi2szi
res.mdpi.com (web)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2020-01-23
Language   en ?
Container Metadata
Open Access Publication
In DOAJ
In ISSN ROAD
In Keepers Registry
ISSN-L:  2078-2489
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 680b98ee-2498-48b0-8eba-66a496ae9ac5
API URL: JSON