DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
Abstract
Denoising diffusion probabilistic models (DDPMs) are expressive generative models and have been successfully applied in various speech synthesis tasks. However, their expensive sampling makes it hard to apply DDPMs in real-time speech processing applications. In this paper, we introduce DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model able to achieve high-fidelity and efficient speech synthesis. DiffGAN-TTS is built on denoising diffusion generative adversarial networks (GANs), which adopt an expressive model to approximate the denoising distribution. Large denoising steps are allowed in DiffGAN-TTS to make generation process efficient. We show with multi-speaker TTS experiments that DiffGAN-TTS is able to generate high-fidelity speech samples within only 4 denoising steps. To further accelerate inference, we present an active shallow diffusion mechanism. A two-stage training scheme is designed, where a basic TTS acoustic model trained at stage one provides strong prior information for a DDPM trained at stage two. Our experiments show that DiffGAN-TTS can achieve high synthesis performance with only 1 denoising step.
Contents
Proposed Approach Overview
Visualization of denoising process of DiffGAN-TTS(T=4) at inference.
Visualization of denoising process of DiffGAN-TTS(two-stage) at inference.
Multi-speaker TTS (in Mandarin Chinese)
1. Text: 感情这个东西很痛苦,我现在不太喜欢谈恋爱。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
2. Text: 哥们儿别慌,我们等得起。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
3. Text: 一个个儿傻乎乎的,多纯情多浪漫,真是比白痴还白痴。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
4. Text: 你先休息,我去单位晃晃,顺便通知下其他哥们儿。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
5. Text: 你们那儿天真冷,我们这儿还穿单衣呢。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
6. Text: 你是学习摄影的,可以把你家的小狗当模特。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
7. Text: 检察官也建议宜尽快带当事人进行治疗。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
8. Text: 别人和我讲话,我一开口对方既惊喜又惊讶。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
9. Text: 灶房里两袋土豆是全家一冬的蔬菜。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
10. Text: 他选中指挥这一远征队的人是太监郑和。
Ground Truth | FastSpeech 2 | GANSpeech | DiffSpeech | DiffGAN-TTS(T=1) | DiffGAN-TTS(T=2) | DiffGAN-TTS(T=4) | DiffGAN-TTS(Two-stage) |
---|---|---|---|---|---|---|---|
Speaker variations in DiffGAN-TTS (T=4)
1. Text: 你是学习摄影的,可以把你家的小狗当模特。
Speaker 1 | Speaker 2 | Speaker 3 | Speaker 4 | Speaker 5 | Speaker 6 | Speaker 7 | Speaker 8 | Speaker 9 | Speaker 10 |
---|---|---|---|---|---|---|---|---|---|
2. Text: 一个个儿傻乎乎的,多纯情多浪漫,真是比白痴还白痴。
Speaker 1 | Speaker 2 | Speaker 3 | Speaker 4 | Speaker 5 | Speaker 6 | Speaker 7 | Speaker 8 | Speaker 9 | Speaker 10 |
---|---|---|---|---|---|---|---|---|---|
Ablation study
Text | 别人和我讲话,我一开口对方既惊喜又惊讶。 | 灶房里两袋土豆是全家一冬的蔬菜。 |
---|---|---|
DiffGAN-TTS (T=4) | ||
Without Mel loss | ||
Without feature machting loss | ||
Add latent variable z |