DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

Anonymous Authors

Abstract

Denoising diffusion probabilistic models (DDPMs) are expressive generative models and have been successfully applied in various speech synthesis tasks. However, their expensive sampling makes it hard to apply DDPMs in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model able to achieve high-fidelity and efficient speech synthesis. DiffGAN-TTS is built on denoising diffusion generative adversarial networks (GANs), which adopt an expressive model to approximate the denoising distribution. Large denoising steps are allowed in DiffGAN-TTS to make generation process efficient. We show with multi-speaker TTS experiments that DiffGAN-TTS is able to generate high-fidelity speech samples within only 4 denoising steps. To further accelerate inference, we present an active shallow diffusion mechanism. A two-stage training scheme is designed, where a basic TTS acoustic model trained at stage one provides strong prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.

Contents


Proposed Approach Overview



Visualization of denoising process of DiffGAN-TTS(T=4) at inference.



Visualization of denoising process of DiffGAN-TTS(two-stage) at inference.



Multi-speaker TTS (in Mandarin Chinese)

1. Text: 感情这个东西很痛苦,我现在不太喜欢谈恋爱。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

2. Text: 哥们儿别慌,我们等得起。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

3. Text: 一个个儿傻乎乎的,多纯情多浪漫,真是比白痴还白痴。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

4. Text: 你先休息,我去单位晃晃,顺便通知下其他哥们儿。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

5. Text: 你们那儿天真冷,我们这儿还穿单衣呢。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

6. Text: 你是学习摄影的,可以把你家的小狗当模特。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

7. Text: 检察官也建议宜尽快带当事人进行治疗。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

8. Text: 别人和我讲话,我一开口对方既惊喜又惊讶。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

9. Text: 灶房里两袋土豆是全家一冬的蔬菜。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)

10. Text: 他选中指挥这一远征队的人是太监郑和。

Ground Truth FastSpeech 2 GANSpeech DiffSpeech DiffGAN-TTS(T=1) DiffGAN-TTS(T=2) DiffGAN-TTS(T=4) DiffGAN-TTS(Two-stage)


Speaker variations in DiffGAN-TTS (T=4)

1. Text: 你是学习摄影的,可以把你家的小狗当模特。

Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10

2. Text: 一个个儿傻乎乎的,多纯情多浪漫,真是比白痴还白痴。

Speaker 1 Speaker 2 Speaker 3 Speaker 4 Speaker 5 Speaker 6 Speaker 7 Speaker 8 Speaker 9 Speaker 10


Ablation study

Text 别人和我讲话,我一开口对方既惊喜又惊讶。 灶房里两袋土豆是全家一冬的蔬菜。
DiffGAN-TTS (T=4)
Without Mel loss
Without feature machting loss
Add latent variable z