Audio Samples from StyleTTS 2 (original) (raw)

Audio Samples from "StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"

Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

StyleTTS 2	JETS	VITS	StyleTTS

This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.

All utterances were unseen during training, and some were selected to match demo samples of non-public models (e.g. NaturalSpeech 1 & 2 or Vall-E) for comparison purposes.

For more samples, you can download our metadata that contains all audios used for evaluations and the survey results here.

1. Single-Speaker (LJSpeech, In-Distribution Texts)
2. Single-Speaker (LJSpeech, Out-Of-Distribution Texts)
3. Multi-Speaker (VCTK)
4. Zero-shot Speaker Adaptation (LibriTTS)
5. Longform Narration
6. Speech Expressiveness
7. Speech Diversity
8. Ablation Study

1. Single-Speaker (LJSpeech, In-Distribution Texts)

Text: After the construction and action of the machine had been explained, the doctor asked the governor what kind of men he had commanded at Goree,

Ground Truth	StyleTTS 2	NaturalSpeech	JETS	VITS	StyleTTS

Text: The lax discipline maintained in Newgate was still further deteriorated by the presence of two other classes of prisoners who ought never to have been inmates of such a jail.

Ground Truth	StyleTTS 2	NaturalSpeech	JETS	VITS	StyleTTS

Text: Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.

Ground Truth	StyleTTS 2	NaturalSpeech	JETS	VITS	StyleTTS

Text: it is not possible to state with scientific certainty that a particular small group of fibers come from a certain piece of clothing.

Ground Truth	StyleTTS 2	NaturalSpeech	JETS	VITS	StyleTTS

2. Single-Speaker (LJSpeech, Out-Of-Distribution Texts)

This section contains OOD samples with ground truth audios taken from LibriVox. All the 40 clips used in our experiment can be downloaded here. Note the sample quality difference between this section and previous section.

Text: Then leaving the corpse within the house they go themselves to and fro about the city and beat themselves, with their garments bound up by a girdle

Ground Truth	StyleTTS 2	JETS	VITS	StyleTTS

Text: Write your name and address clearly. Mail a note and a duplicate list at the time you send the box.

Ground Truth	StyleTTS 2	JETS	VITS	StyleTTS

Text: Indeed, she is said to have angled with Napoleonic strategy for that same offer, and to have won it only after a sharp struggle of wits.

Ground Truth	StyleTTS 2	JETS	VITS	StyleTTS

Text: not as a lifeless thing, but with the same enjoyment of rest as gladdened the hearts of the two beings, who, with gratitude and love

Ground Truth	StyleTTS 2	JETS	VITS	StyleTTS

3. Multi-Speaker (VCTK)

This section contains samples from our multi-speaker VCTK model, alongside the reference audios that were used to generate these samples. Note that StyleTTS 2 faithfully replicates the speaking styles (e.g., background noise, pitch tone, voice etc.) of the reference audios, making it more similar to the reference than the ground truth.

Text	Since then physicists have found that it is not reflection, but refraction by the raindrops which causes the rainbows.	Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.	She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.	Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern.
Ground Truth
Reference
StyleTTS 2
VITS

4. Zero-shot Speaker Adaptation (LibriTTS)

This section contains samples from our multi-speaker LibriTTS model. For each model, the size of the training set is provided. Note that all speakers are unseen during the training process.

Text	Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.	Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings.	And lay me down in my cold bed and leave my shining lot.	The army found the people in poverty and left them in comparative wealth.
Ground Truth
Reference
StyleTTS 2 (~245 Hours)
VALL-E (~60k Hours)
NaturalSpeech 2 (~44k Hours)
StyleTTS (~245 Hours)
VITS (~245 Hours)
YourTTS (~300 Hours)

It is noteworthy that our model matches the generalization capabilities of VALL-E, including maintaining the acoustic environment and speaker's emotion, but with 250x less data. The subsequent samples are sourced from the official VALL-E demo page.

Acoustic Environment Maintenance

Text: As friends thing I definitely I've got more male friends.

Reference	StyleTTS 2	VALL-E	Ground Truth

Text: Everything is run by computer but you got to know how to think before you can do a computer.

Reference	StyleTTS 2	VALL-E	Ground Truth

Text: Then out in LA you guys got a whole another ball game within California to worry about.

Reference	StyleTTS 2	VALL-E	Ground Truth

Speaker’s Emotion Maintenance

We have to reduce the number of plastic bags.

Emotion	Anger	Sleepy	Amused	Disgusted
Reference
StyleTTS 2
Vall-E

6. Speech Expressiveness

The samples below were synthesized using texts generated by GPT-4 in four distinct emotions: happiness, sadness, anger, and surprise. These samples were generated using both LJSpeech and LibriTTS models, in support of Figure 2.

Additionally, we demonstrate the potential to synthesize expressive speech from an unseen speaker for this task using the first speaker (1221-135767) from section 4.

Text	Happy: We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands.	Sad: I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence.	Angry: The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations.	Surprised: I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?
LJSpeech
Unseen Speaker

Text	Happy: He was a merry fellow, this Jack Sheppard, and his exploits were the talk of the town.	Sad: He was condemned to death, and suffered on the gallows at Tyburn, protesting his innocence to the last, and leaving behind him a touching farewell to his wife and children.	Angry: We must angrily reject the status quo that benefits only the rich!	Surprised: Holmes, you astound me! How did you deduce that the murderer was none other than the victim's own brother?
LJSpeech
Unseen Speaker

Style Transfer

Since our model disentangles speech and style vectors, it is capable of style transfer to any input text. This is achieved by first sampling a style with an emotional text and then synthesizing the speech with this emotional style vector.

The ensuing samples were synthesized with styles sampled through style diffusion conditioned on texts with explicit emotions. Note that neither the target text nor the reference audio contains any emotional content.

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Angry	Happy	Sad	Surprised

7. Speech Diversity

How much variation is there?

StyleTTS 2
VITS
FastDiff
StyleTTS 2 (unseen speaker)

8. Ablation Study

This section provides samples from ablated models using text from the test-clean subset of the LibriTTS dataset. The differences may be nuanced, so below we have outlined each model variant and its implications:

Baseline: Our proposed model, StyleTTS 2.
No Style Diffusion: This variant encodes style vectors from random references rather than sampling them from style diffusion, as in the original StyleTTS. The model is identical to the baseline model, except style diffusion is not used during inference. This modification impacts all aspects of the speech, including pauses, emotions, speaking rates, and sound quality, as these factors are highly correlated with the style vector, which in turn most significantly affects naturalness in our experiment.
No Prosodic Style Encoder: This model does not have a prosodic style encoder in its architecture. The primary impact of this is on the sound quality due to the divergent gradient from joint training. However, the prosody and pauses are generally natural due to the use of style diffusion and SLM adversarial training.
No SLM Discriminator: This model does not include adversarial training with speech language models (SLMs). The naturalness of the speech, specifically the prosody and pauses, may be compromised. In some cases, sound quality may also be affected as SLM discriminators also capture acoustic mismatches.
No Differentiable Upsampler: In this version, gradients are not propagated back to the duration predictor during SLM adversarial training. The primary effect is on the speech's pauses (duration), while the prosody remains largely unaffected.
No OOD Texts: This model does not use out-of-distribution (OOD) texts during adversarial training. Its impact on OOD texts is similar to that of the model lacking SLM discriminators.

Text: The answer to this will depend upon the length of the play, for upon the length depends the hour at which the curtain rises.

Baseline	No Style Diffusion	No Prosodic Style Encoder	No SLM Discriminator	No Differentiable Upsampler	No OOD Texts

Text: Well, sir, we never make coffee but in the afternoon. Would you like a good bavaroise, or a decanter of orgeat?

Baseline	No Style Diffusion	No Prosodic Style Encoder	No SLM Discriminator	No Differentiable Upsampler	No OOD Texts

Text: On the evening of the day of Alexandra's call at the Shabatas', a heavy rain set in.

Baseline	No Style Diffusion	No Prosodic Style Encoder	No SLM Discriminator	No Differentiable Upsampler	No OOD Texts

Text: Ojo was hungry, though; so he divided the piece of bread upon the table and ate his half for breakfast, washing it down with fresh, cool water from the brook.

Baseline	No Style Diffusion	No Prosodic Style Encoder	No SLM Discriminator	No Differentiable Upsampler	No OOD Texts