[PDF][PDF] Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture.

M Fujimoto, H Kawai - Interspeech, 2021 - isca-archive.org
M Fujimoto, H Kawai
Interspeech, 2021isca-archive.org
This paper addresses a noise-robust automatic speech recognition (ASR) method under the
constraints of real-time, one-pass, and single-channel processing. Under such strong
constraints, single-channel speech enhancement becomes a key technology because
methods with multiple-passes or batch processing, such as acoustic model adaptation, are
not suitable for use. However, single-channel speech enhancement often degrades ASR
performance due to speech distortion. To overcome this problem, we propose a noise robust …
Abstract
This paper addresses a noise-robust automatic speech recognition (ASR) method under the constraints of real-time, one-pass, and single-channel processing. Under such strong constraints, single-channel speech enhancement becomes a key technology because methods with multiple-passes or batch processing, such as acoustic model adaptation, are not suitable for use. However, single-channel speech enhancement often degrades ASR performance due to speech distortion. To overcome this problem, we propose a noise robust acoustic modeling method based on the stream-wise transformer model. The proposed method accepts multi-stream features obtained by multiple single-channel speech enhancement methods as input and selectively uses an appropriate feature stream according to the noise environment by paying attention to the noteworthy stream on the basis of multi-head attention. The proposed method considers the attention for the stream direction instead of the time series direction, and it is thus capable of real-time and low-latency processing. Comparative evaluations reveal that the proposed method successfully improves the accuracy of ASR in noisy environments and reduces the number of model parameters even under strong constraints.
isca-archive.org