Comparison of real-time multi-speaker neural vocoders on CPUs

K Matsubara, T Okamoto, R Takashima… - Acoustical Science …, 2022 - jstage.jst.go.jp
K Matsubara, T Okamoto, R Takashima, T Takiguchi, T Toda, H Kawai
Acoustical Science and Technology, 2022jstage.jst.go.jp
Because the WaveNet vocoder has an autoregressive architecture that consists of large
convolutional layers, its inference speed is very slow and it is not suitable for applications
that required low latency. To solve this problem, many neural vocoders have been proposed
that can synthesize high-quality speech waveforms in real time. In addition, to reduce the
deterioration in speech that occurs when neural vocoders try to synthesize the speech of
unseen speakers, many approaches have been proposed that train neural vocoders using …
Because the WaveNet vocoder has an autoregressive architecture that consists of large convolutional layers, its inference speed is very slow and it is not suitable for applications that required low latency. To solve this problem, many neural vocoders have been proposed that can synthesize high-quality speech waveforms in real time. In addition, to reduce the deterioration in speech that occurs when neural vocoders try to synthesize the speech of unseen speakers, many approaches have been proposed that train neural vocoders using mass multi-speaker datasets to acquire the ability to synthesize such speech [4]. These approaches have recently become popular because they can be combined with multi-speaker TTS or many-to-many voice conversion, and some lightweight, multi-speaker neural vocoders can now realize real-time synthesis on CPUs [5, 6]. In this paper, we investigate the performances of real-time and multi-speaker neural vocoders: HiFi-GAN [5], Multiband WaveRNN with data-driven linear prediction (MWDLP)[6], and LPCNet [7]. Although official implementations of these vocoder models have been released by their authors, a unified performance comparison has not been conducted. Hence, this paper reveals the performance of these models with respect to both single-speaker and multi-speaker training. Regarding the synthesis speed, we compare the changes in the real-time factor (RTF) when using single-core and multi-core CPUs. In particular, we believe this investigation is important because the synthesis speed of non-autoregressive models such as HiFi-GAN tends to improve remarkably when the number of CPU cores is increased [8]. Additionally, our previous research revealed that the input acoustic features proposed in LPCNet were robust enough that the TTS model did not perceive any degradation in the output [9]. Hence, we also use these acoustic features in HiFi-GAN and MWDLP. Experimental results show that, for both single-speaker and multi-speaker synthesis, HiFi-GAN is superior in quality and synthesis speed.
2. Real-time multi-speaker neural vocoder We briefly introduce LPCNet, HiFi-GAN, and MWDLP, which can execute multi-speaker speech synthesis in real time on CPUs. Although the original LPCNet study evaluated only single-speaker speech synthesis, additional work [10] reported that LPCNet can perform multi-speaker speech synthesis. Although low-delay, real-time synthesis is possible for autoregressive models such as MWDLP and LPCNet, this study focuses on real-time synthesis assuming batch processing.
jstage.jst.go.jp
Showing the best result for this search. See all results