Disambiguating Speech Intention via Audio-Text Co-attention Framework: A
Case of Prosody-semantics Interface
release_n56ub5bex5hbnli5npe7emjina
by
Won Ik Cho, Jeonghwa Cho, Woo Hyun Kang, Nam Soo Kim
2019
Abstract
Understanding the intention of an utterance is challenging for some
prosody-sensitive cases, especially when it is in the written form. The main
concern is to detect the directivity or rhetoricalness of an utterance and to
distinguish the type of question. Since it is inevitable to face both the
issues regarding prosody and semantics, the identification is expected to
benefit from the observations of human language processing mechanism. In this
paper, we combat the task with attentive recurrent neural networks that exploit
acoustic and textual features, using a manually created speech corpus that
incorporates only the syntactically ambiguous utterances which require prosody
for disambiguation. We found out that co-attention frameworks on audio-text
data, namely multi-hop attention and cross-attention, can perform better than
previously suggested speech-based/text-aided networks. By this, we infer that
understanding the genuine intention of the ambiguous utterances incorporates
recognizing the interaction between auditory and linguistic processes.
In text/plain
format
Archived Files and Locations
application/pdf 393.3 kB
file_77uzzcg76rbezi3fl2lp4ihnl4
|
arxiv.org (repository) web.archive.org (webarchive) |
1910.09275v1
access all versions, variants, and formats of this works (eg, pre-prints)