Disambiguating Speech Intention via Audio-Text Co-attention Framework: A Case of Prosody-semantics Interface release_n56ub5bex5hbnli5npe7emjina

by Won Ik Cho, Jeonghwa Cho, Woo Hyun Kang, Nam Soo Kim

Released as a article .

2019  

Abstract

Understanding the intention of an utterance is challenging for some prosody-sensitive cases, especially when it is in the written form. The main concern is to detect the directivity or rhetoricalness of an utterance and to distinguish the type of question. Since it is inevitable to face both the issues regarding prosody and semantics, the identification is expected to benefit from the observations of human language processing mechanism. In this paper, we combat the task with attentive recurrent neural networks that exploit acoustic and textual features, using a manually created speech corpus that incorporates only the syntactically ambiguous utterances which require prosody for disambiguation. We found out that co-attention frameworks on audio-text data, namely multi-hop attention and cross-attention, can perform better than previously suggested speech-based/text-aided networks. By this, we infer that understanding the genuine intention of the ambiguous utterances incorporates recognizing the interaction between auditory and linguistic processes.
In text/plain format

Archived Files and Locations

application/pdf  393.3 kB
file_77uzzcg76rbezi3fl2lp4ihnl4
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2019-10-21
Version   v1
Language   en ?
arXiv  1910.09275v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 56dead56-8d27-4f44-a451-4e1ef7c19f90
API URL: JSON