Audio Sample — parakeet 0.4 documentation (original) (raw)

The main processes of TTS include:

Convert the original text into characters/phonemes, through text frontend module.
Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through Acoustic models.
Convert acoustic features into waveforms through Vocoders.

When training Tacotron2、TransformerTTS and WaveFlow, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech, FastSpeech2 and ParallelWaveGAN, we use Chinese single speaker dataset CSMSC by default.

In the future, Parakeet will mainly use Chinese TTS datasets for default examples.

Here, we will display three types of audio samples:

Analysis/synthesis (ground-truth spectrograms + Vocoder)
TTS (Acoustic model + Vocoder)
Chinese TTS with/without text frontend (mainly tone sandhi)

Analysis/synthesis

Audio samples generated from ground-truth spectrograms with a vocoder.

LJSpeech(English)

GT	WaveFlow
Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.	Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

CSMSC(Chinese)

GT (convert to 24k)	ParallelWaveGAN
Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.	Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

TTS

Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.

TransformerTTS + WaveFlow	Tacotron2 + WaveFlow
Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.	Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

TransformerTTS + WaveFlow

Tacotron2 + WaveFlow

Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

SpeedySpeech + ParallelWaveGAN	FastSpeech2 + ParallelWaveGAN
Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.	Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

SpeedySpeech + ParallelWaveGAN

FastSpeech2 + ParallelWaveGAN

Chinese TTS with/without text frontend

We provide a complete Chinese text frontend module in Parakeet. Text Normalization and G2P are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P module here.

We use FastSpeech2 + ParallelWaveGAN here.

With Text Frontend	Without Text Frontend
Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.	Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element. Your browser does not support the audio element.

With Text Frontend

Without Text Frontend