A multimodal simultaneous interpretation prototype: Who said what

X Wang, M Utiyama, E Sumita - … of the 15th Biennial Conference of …, 2022 - aclanthology.org
Proceedings of the 15th Biennial Conference of the Association for …, 2022aclanthology.org
Abstract “Who said what” is essential for users to understand video streams that have more
than one speaker, but conventional simultaneous interpretation systems merely present
“what was said” in the form of subtitles. Because the translations unavoidably have delays
and errors, users often find it difficult to trace the subtitles back to speakers. To address this
problem, we propose a multimodal SI system that presents users “who said what”. Our
system takes audio-visual approaches to recognize the speaker of each sentence, and then …
Abstract
“Who said what” is essential for users to understand video streams that have more than one speaker, but conventional simultaneous interpretation systems merely present “what was said” in the form of subtitles. Because the translations unavoidably have delays and errors, users often find it difficult to trace the subtitles back to speakers. To address this problem, we propose a multimodal SI system that presents users “who said what”. Our system takes audio-visual approaches to recognize the speaker of each sentence, and then annotates its translation with the textual tag and face icon of the speaker, so that users can quickly understand the scenario. Furthermore, our system is capable of interpreting video streams in real-time on a single desktop equipped with two Quadro RTX 4000 GPUs owing to an efficient sentence-based architecture.
aclanthology.org
Showing the best result for this search. See all results