Efficient Use of Resources for Statistical Machine Translation release_jdvkoy67tzfnbneifyivodgara

by Karunesh Kumar Arora, Shyam Sunder Agrawal

Published in DESIDOC Journal of Library & Information Technology by Defence Scientific Information and Documentation Centre.

2017   Volume 37, p307

Abstract

<div class="page" title="Page 1"><div class="layoutArea"><div class="column"><span>Machine translation has great potential to expand the audience for ever increasing digital collections. Success of data driven machine translation systems is governed by the volume of parallel data on which these systems are being modelled. The languages which do not have such resources in huge quantity, the optimum utilisation of them can only be assured through their quality. Morphologically rich language like Hindi poses further challenge, due to </span><span>having more number of orthographic inflections for a given word and presence of non-standard word spellings in </span><span>the corpus. This increases the chances of getting more number of words which are unseen in the training corpus. In this paper, the objective is to reduce redundancy of available corpus and utilise the other resources as well, to make best use of resources. Reduction in number of words unseen to the translation model is achieved through text noise removal, spell normalisation and utilising English WordNet (EWN). The test case presented here is for English-Hindi language pair. The results achieved are promising and set example for other morphological rich languages to optimise the resources to improve the performance of the translation system. </span></div></div></div>
In application/xml+jats format

Archived Files and Locations

application/pdf  757.3 kB
file_nxkw76p7abbn5goph2dwryc6ke
web.archive.org (webarchive)
publications.drdo.gov.in (web)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2017-10-23
Journal Metadata
Open Access Publication
Not in DOAJ
Not in Keepers Registry
ISSN-L:  0974-0643
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 89119cd3-5b8c-41de-9e4a-3c09b222624a
API URL: JSON