Audio Sample — parakeet 0.4 documentation (original) (raw)
The main processes of TTS include:
- Convert the original text into characters/phonemes, through
text frontendmodule. - Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through
Acoustic models. - Convert acoustic features into waveforms through
Vocoders.
When training Tacotron2、TransformerTTS and WaveFlow, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech, FastSpeech2 and ParallelWaveGAN, we use Chinese single speaker dataset CSMSC by default.
In the future, Parakeet will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples:
- Analysis/synthesis (ground-truth spectrograms + Vocoder)
- TTS (Acoustic model + Vocoder)
- Chinese TTS with/without text frontend (mainly tone sandhi)
Analysis/synthesis
Audio samples generated from ground-truth spectrograms with a vocoder.
LJSpeech(English)
| GT | WaveFlow |
|---|---|
| Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. | Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. |
CSMSC(Chinese)
| GT (convert to 24k) | ParallelWaveGAN |
|---|---|
| Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. | Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. |
TTS
Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.
| TransformerTTS + WaveFlow | Tacotron2 + WaveFlow |
|---|---|
| Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. | Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. |
| SpeedySpeech + ParallelWaveGAN | FastSpeech2 + ParallelWaveGAN |
|---|---|
| Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. | Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. |
Chinese TTS with/without text frontend
We provide a complete Chinese text frontend module in Parakeet. Text Normalization and G2P are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P module here.
We use FastSpeech2 + ParallelWaveGAN here.
| With Text Frontend | Without Text Frontend |
|---|---|
| Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. | Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. |