Deep neural network-based power spectrum reconstruction to improve quality of vocoded speech with limited acoustic parameters

T Okamoto, K Tachibana, T Toda, Y Shiga… - Acoustical Science and …, 2018 - jstage.jst.go.jp
T Okamoto, K Tachibana, T Toda, Y Shiga, H Kawai
Acoustical Science and Technology, 2018jstage.jst.go.jp
The vocoder in SPSS is a module that converts acoustic features estimated from linguistic
information by acoustic models into speech waveforms. Some vocoders have been
investigated from a simple mel-log spectrum approximate (MLSA) filter with a simple pulse
excitation and melcepstrum [1] to high-quality ones, such as STRAIGHT [2] and WORLD [3].
However, these high-quality vocoders are intended to analyze and convert high-quality
speech and a number of acoustic parameters necessary to synthesize speech with the same …
The vocoder in SPSS is a module that converts acoustic features estimated from linguistic information by acoustic models into speech waveforms. Some vocoders have been investigated from a simple mel-log spectrum approximate (MLSA) filter with a simple pulse excitation and melcepstrum [1] to high-quality ones, such as STRAIGHT [2] and WORLD [3]. However, these high-quality vocoders are intended to analyze and convert high-quality speech and a number of acoustic parameters necessary to synthesize speech with the same quality as the original, but not for TTS. The number of parameters must be reduced to apply these highquality vocoders to SPSS due to constraints on the number of parameters [4]. These constraints deteriorate synthesis quality even if the acoustic model perfectly estimates the acoustic parameters. In other words, speech quality in TTS will reach a peak due to the vocoder performance. Herein a method is investigated to improve this upper limit. In SPSS, an acoustic model is trained from speech corpora and the maximum likelihood model parameters are estimated. Recently, deep neural networks (DNNs) have been introduced for acoustic model training in SPSS. DNN improves synthesis accuracy compared to the conventional hidden Markov model (HMM)[5, 6]. Additionally corpus-dependent high-quality vocoders with DNNs have been investigated [7, 8], whereas the conventional high-quality ones described above [2, 3] are corpus-independent. Although corpus-dependent high-quality vocoders with DNNs improve the speech quality compared to the conventional STRAIGHT vocoder in both HMM-and DNN-based speech synthesis [7], the synthesis quality depends greatly on the estimation accuracy of the glottal closure instants [9]. Neural network-based vocoders such as WaveNety and SampleRNN [8] require a huge amount of speech corpus for high-quality synthesis.
This paper provides a method to improve the vocoded speech quality and is applicable to arbitrary vocoders with limited acoustic parameters. The proposed method reconstructs high-quality speech signals from the vocoded ones with DNNs. As a first attempt, a neural network, which can reconstruct the power spectrum of the original speech waveform from that of the vocoded one, is trained. Experiments on both analysis-synthesis speech and SPSS with a Japanese female speech corpus are conducted. Compared to the conventional DNN-based postfiltering to reconstruct the mel-cepstral and STRAIGHT spectral coefficients from the over-smoothed acoustic parameters estimated by HMM-based acoustic models [10], the proposed method directly reconstructs the power spectrum of the vocoded speech waveform, improving the upper limit of vocoders with limited acoustic parameters.
jstage.jst.go.jp
Showing the best result for this search. See all results