Improving very deep time-delay neural network with vertical-attention for effectively training ctc-based asr systems

S Li, X Lu, R Takashima, P Shen… - 2018 IEEE Spoken …, 2018 - ieeexplore.ieee.org
S Li, X Lu, R Takashima, P Shen, T Kawahara, H Kawai
2018 IEEE Spoken Language Technology Workshop (SLT), 2018ieeexplore.ieee.org
The very deep neural network has recently been proposed for speech recognition and
achieves significant performance. It has excellent potential for integration with end-to-end
(E2E) training. Connectionist temporal classification (CTC) has shown great potential in E2E
acoustic modeling. In this study, we investigate deep architectures and techniques which are
suitable for CTC-based acoustic modeling. We propose a very deep residual time-delay
CTC neural network (VResTD-CTC). How to select a suitable deep architecture optimized …
The very deep neural network has recently been proposed for speech recognition and achieves significant performance. It has excellent potential for integration with end-to-end (E2E) training. Connectionist temporal classification (CTC) has shown great potential in E2E acoustic modeling. In this study, we investigate deep architectures and techniques which are suitable for CTC-based acoustic modeling. We propose a very deep residual time-delay CTC neural network (VResTD-CTC). How to select a suitable deep architecture optimized with the CTC objective function is crucial for obtaining the state of the art performance. Excellent performances can be obtained by selecting deep architecture for non-E2E ASR systems modeling with tied-triphone states. However, these optimized structures do not guarantee to achieve better or comparable performances on E2E (e.g., CTC-based) systems modeling with dynamic acoustic units. For solving this problem and further leveraging the system performance, we introduce the vertical-attention mechanism to reweight the residual blocks at each time step. Speech recognition experiments show our proposed model significantly outperforms the DNN and LSTM-based (both bidirectional and unidirectional) CTC baseline models.
ieeexplore.ieee.org
Showing the best result for this search. See all results