Notes on Finetuning · yl4579/StyleTTS2 · Discussion #81 (original) (raw)

I've made a few notes during finetuning runs and figure we could maybe pool our insights into one discussion to help everyone iterate efficiently. I don't claim these to be anything more than my own observations/compilation of useful notes. Take them with a grain of salt, especially since there is such rapid development happening as of writing this. I am not affiliated with the authors of this lovely TTS model.
Also take alook through closed issues if you're running into trouble. There is some useful information in them.

Teaching the model new features

Text dataset quality

Robustness

Artifacts

Finetune training Stages

Base

Style Diffusion

SLM Adversarial Training

Errors, Crashes

One or both of the following conditions are present:



The input text is too long. If this is happening during training, check your dataset and split up extremely long sentences into more manageable ones. Make sure that if you use a custom OOD text, you split sentences on punctuation and ensure they don't become entire paragraphs. Anything that would take you longer than 10 seconds to speak is probably a candidate for splitting in half.


batch_size must be 2 or greater, or you will run into this error.


If this appears when your finetuning is trying to begin the SLM Adversarial Training, then your diff_epoch is set to a later epoch than joint_training. for example: diff_epoch = 5 , joint_training = 4 is not valid. You would want joint_training to be the bigger number.


If you are running finetuning across multiple GPUs, your chosen batch_size may be too small and result in each GPU only getting a batch of 1. Increase the batch_size.

Mixed-precision Training

If you don't want to run training in full precision, you can now run finetuning at mixed-precision.

VRAM Usage

VRAM Usage Strategies

Checkpoints

Resuming Finetuning

Logging

Operating system support

Hardware Requirements

Quality comparisons

Dataset: Custom dataset for Garrus from Mass Effect. 30 emotions/styles tagged as speaker IDs, total duration about 5h50min. Custom text preprocessing. Audio includes flanger, this is not a model error but quite desired for a Turian voice. Using a custom OutOfDomain text dataset for SLM AT.
Epoch: 5 , for all examples. There is still room for improvement with more epochs.
Sampling: alpha=0.3, beta=0.7, diffusion_steps=10, embedding_scale=1

Text:
You are reading a discussion page on Github, imagine that! I think the human saying is: "Git good!" Wonder why they didn't choose "that" name.

(The "quoted" words are used for extra emphasis in my dataset)

Phoneme version:
juː ɑːɹ ɹˈiːdɪŋ ɐ dɪskˈʌʃən pˈeɪdʒ ˌɔn ɡˈɪthʌb , ɪmˈædʒɪn ðˈæt ! ˈaɪ θˈɪŋk ðə hjˈuːmən sˈeɪɪŋ ɪz : `` ɡˈɪt ɡˈʊd '' ! wˈʌndɚ wˌaɪ ðeɪ dˈɪdnt tʃˈuːz `` ðˈæt '' nˈeɪm .

batch_size: 4 , max_len: 100, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/11DDidEhJac5

batch_size: 6, max_len: 800, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/18PaQ8F248Hu

batch_size: 4 , max_len: 100, with style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/1bulPARTI2mn

batch_size: 2 , max_len: 175, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/1aqmfuqHS51N
Since running in batchsize 2 takes forever, I only trained for 2 final epochs with slm adversarial finetuning, and prior to that, up to epoch 4 with batchsize of 4 with 100 max_len.

batch_size: 4, max_len: 800, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/11QGDDMhWsNU

These quality examples don't reflect the maximum quality possible and are just for illustration purposes. :>

Hope this is useful.