Notes on Finetuning · yl4579/StyleTTS2 · Discussion #81 (original) (raw)

I've made a few notes during finetuning runs and figure we could maybe pool our insights into one discussion to help everyone iterate efficiently. I don't claim these to be anything more than my own observations/compilation of useful notes. Take them with a grain of salt, especially since there is such rapid development happening as of writing this. I am not affiliated with the authors of this lovely TTS model.
Also take alook through closed issues if you're running into trouble. There is some useful information in them.

Teaching the model new features

Is possible.
If your dataset contains punctuation that was not present in the original dataset for the finetuned checkpoint (but is accepted by the espeak phonemizer and passed through), then the model will learn the new features relatively quickly if they have a good match in the audio files.
It is also possible to make the model forget unwanted features by never including them.

Text dataset quality

The phonemization+tokenization processing of StyleTTS2 distinguishes between opening and closing quote pairs and tags them differently in the tokenized phoneme transcription. The model picks up on these differences. But this does not work for stray quotes.
Preprocess the text carefully to maximize the amount of sentence punctuation that espeak carries through to the phonemized output. Ensure that punctuation matches pauses. This will make the model a lot more predictable and less likely to skip over punctuation.
The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts. This is a common oversight in many datasets.
Also it lacks variety of punctuation. In the field, you may encounter texts with creative use of dashes, pauses and combination of quotes and punctuation. LibriTTS lacks those cases. But the model can learn these!
Additionally, LibriTTS has stray quotes in some texts, or begins a sentence with a quote. These things reduce quality a little (or a lot, sometimes). You will want to filter those out.

Robustness

StyleTTS2 seems quite robust overall.
The model can be trained with audiofiles that have baked-in effects. They will be reproduced.

Artifacts

Artifacts seem to be - in part - the result of a too short max_len (in addition to poor audio cleanup and a low quality transcription of course)
At a max_len of 100 (=1.25 seconds), finetuning is possible, but the start and end of generated audio may accumulate distortion and pops.
At a max_len of 800 (=10 seconds), quality is excellent even after one epoch and improves on subsequent iterations. This length covers the majority of audio datasets (as you know the free standard datasets adhere to the duration limitations that autoregressive models like tacotron 1/2 established years ago - due to their attention-mechanism imploding after 10-12 seconds)
max_len of 400 and 600 also works well.
For small datasets (shorter than 1 hour) you can generally set this value much higher than on big datasets, since overall the VRAM usage is lower to start with.
Providing a clean reference audio file to compute the style helps a great deal in mitigating artifacts.
Consider adding 100ms of silence to the end of all audio files in your dataset, and add a stop-token to the end of all of your dataset sentences. More details further down in the comment discussions. This works really well and can massively reduce artifacts at the end of long generated sentences.
Refer to the repo author's comment below if you wish to use $ as a stoptoken.

Finetune training Stages

Base

You can train with both Style Diffusion and SLM Adversarial Training disabled.
If your dataset is relatively normal (read: boring, human, similar to the LibriTTS voices), skipping finetuning style diffusion can work, but won't deliver all that this model architecture is capable of.

Style Diffusion

config parameter name: diff_epoch
This parameter starts counting at 0. For example to start diffusion training on epoch 5, set this parameter to (5-1) 4
You can disable style diffusion training by setting diff_epoch to a value that is larger than your total number of epochs.
For large datasets, having it start on the second epoch can work.
For smaller datasets, start it on a later epoch as you will need to iterate through many epochs anyway. Look at the defaults of the finetune config file that ships with this repo.
Not finetuning this stage saves some VRAM at the expense of worse inference quality.

SLM Adversarial Training

config parameter name: joint_epoch
This parameter starts counting at 0. For example to start SLM adversarial training on epoch 10, set this parameter to (10-1) 9
joint_epoch must be set to a higher number than diff_epoch, or you will encounter an error. You cannot run SLM Adversarial Training before you begin running Style Diffusion training.
You can disable SLM adversarial training by setting joint_epoch to a value that is larger than your total number of epochs.
Additional noteworthy parameter: batch_percentage
Defaults to 0.5. Adds (batch_size * 0.5) number of batches with SLM Adversarial samples.
Calculate your batch size accordingly with some spare VRAM if you plan to make use of it.
(For example the previous batch size was 6 without SLM adv. training, the new batch size is 4 (since 4*0.5 =) 2 batches will get added, totaling 6 again)
SLM Adv training is (moderately) heavy on computational resources and VRAM.
If you run out of memory, lower your batchsize by 2 and resume finetuning from your last saved checkpoint. Do not lower your batchsize below 2.
Make sure to save your checkpoints at the right intervals so you do not lose progress.
You can also be cheeky and finish a finetuning run with SLM Adv. training disabled, and then resume finetuning from your final checkpoint, with SLM enabled, and adjusted batch_size, for a few extra epochs.
Not finetuning this stage lowers computational cost and lowers inference quality.
You can reduce the VRAM usage of this stage by adjusting the min_len and max_len under the slmadv_params section in the config file within reason.

Errors, Crashes

RuntimeError: Calculated padded input size per channel: (5 x 4). Kernel size: (5 x 5). Kernel size can't be greater than actual input size

One or both of the following conditions are present:

A max_len of less than 100
Audiofiles that are significantly shorter than 1 second.
The fix is very simple though: Remove short <1s audiofiles or merge them into longer files with merged transcripts. Ensure that max_len is at least 100.

Codepage error under Windows ( UnicodeDecodeError: 'charmap' codec can't decode byte .. )
Check the Operating Systems section below.

RuntimeError: The expanded size of the tensor (SOME NUMBER HERE) must match the existing size (512) at non-singleton dimension 1.

The input text is too long. If this is happening during training, check your dataset and split up extremely long sentences into more manageable ones. Make sure that if you use a custom OOD text, you split sentences on punctuation and ensure they don't become entire paragraphs. Anything that would take you longer than 10 seconds to speak is probably a candidate for splitting in half.

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

batch_size must be 2 or greater, or you will run into this error.

UnboundLocalError: local variable 'ref' referenced before assignment

If this appears when your finetuning is trying to begin the SLM Adversarial Training, then your diff_epoch is set to a later epoch than joint_training. for example: diff_epoch = 5 , joint_training = 4 is not valid. You would want joint_training to be the bigger number.

RuntimeError: Given groups=1, weight of size [1, 1, 3], expected input[1, 221, 1] to have 1 channels, but got 221 channels instead

If you are running finetuning across multiple GPUs, your chosen batch_size may be too small and result in each GPU only getting a batch of 1. Increase the batch_size.

Mixed-precision Training

If you don't want to run training in full precision, you can now run finetuning at mixed-precision.

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Path/To/Your/config_ft.yml
Mixed-precision training only works with a single GPU:
Multi-GPU accelerate finetuning/ second-stage training is currently bugged. Check the main repo page or issue 7 for more information if you think you can help fix that.
You can expect minor ( ~10% ) savings in VRAM and minor ( ~5% ) speed improvements if you run it in mixed precision.
Since this means you can bump up the max_len a little bit, you essentially get a small quality improvement for free.

VRAM Usage

Using the default config_ft.yml finetuning config as base, with a 5h50min dataset, training for 5 epochs.
With disabled Style Diffusion and SLM Adversarial training:
batch_size: 4 , max_len: 100 = ~22GB VRAM. Fits onto a 4090 without problems. Training took ~3 hours.
batch_size: 6, max_len: 800 = ~74GB VRAM. Fits onto an A100 without trouble. Training took ~2 hours.
With enabled Style Diffusion (epoch 2+) and disabled SLM Adversarial training:
batch_size: 4 , max_len: 100 = ~23.1GB VRAM. Fits onto a 4090. Training took <4 hours.
batch_size: 4 , max_len: 100 , using accelerate mixed_precision=fp16 = ~21GB VRAM. 4-5% speed boost.
With enabled Style Diffusion (epoch 2+) and SLM Adversarial training (epoch 4+):
batch_size: 4 , max_len: 100 = ~28GB VRAM. Impossible on a 24GB card at this batch-size.
batch_size: 4 , max_len: 100 , using accelerate mixed_precision=fp16 = ~26.6GB VRAM. Still not feasible.
batch_size: 2 , max_len: 175 , using accelerate mixed_precision=fp16 = <19GB VRAM. Fits onto a 4090. Only ran this for the Joint Training epochs.
batch_size: 4, max_len: 800 = ~76.5GB VRAM. Fits onto an A100 without trouble. Training took <3 hours.

VRAM Usage Strategies

You can be smart about VRAM usage by interrupting finetuning and resuming with other parameters:
First, run the training with Style Diffusion finetuning enabled. Set parameters that make good use of your available VRAM.
Stop the training before you reach the epoch at which SLM Adversarial Training begins. Make sure a checkpoint is saved.
Lower the batch size by half, and then resume finetuning from your last saved checkpoint for the number of epochs you intended to run with SLM Adversarial Training. Now it should fit into VRAM whereas before it did not.
That way you don't have to touch the max_len and can benefit from a relatively speedy initial training run, and suffer through less epochs with reduced batch_size to finish training.
You can half the batch_size and double the max_len to stay roughly within the same amount of utilized VRAM, if you keep all other parameters the same. VRAM usage grows a little on subsequent epochs, so keep some spare capacity for long runs.
Halving the batch_size will roughly double the time it takes to run an epoch, but won't negatively impact quality.
Halving the max_len will negatively impact quality. If you can, reduce the batch_size instead.
If you have access to a GPU with 48 or 80GB of VRAM, just set reasonable parameters from the get-go and let it finish.
Do not go below max_len: 100 or batch_size: 2 ever.
If you are running the training/finetuning across several GPUs, you must set a batch_size that provides each GPU with at least a batch of size 2. (For example if you have 4 GPUs, then 8 is the minimal possible batch_size )

Checkpoints

No, you're not missing a checkpoint. They start their numbering at 0.
So epoch_2nd_00000.pth is your completed first epoch, and not an empty checkpoint.
If you terminate training before the first epoch completes, you will be left with no checkpoint and just a logfile.
The checkpoint size is around 1.89GB without Style Diffusion + SLM Adv.
The checkpoint size is around 2.1GB with Style Diffusion + SLM Adv.

Resuming Finetuning

Set these parameters:
pretrained_model: "Models/YourModelName/epoch_2nd_00123.pth"
load_only_params: false
This will continue from the given model checkpoint, and retain the optimizer settings.
There is no quality penalty for resuming like this. (I'm glad we're past those years of training TTS models, ahah)
Be sure to set the total number of epochs to be larger than the checkpoint epoch, or training will conclude before doing anything.

Logging

Adjust the config parameter log_dir: to point to a folder of your choice.
Give the folder a unique name for each run if you want to keep an overview, ie: "Models/MyCoolTTSModel"
Logs and checkpoints will be saved there, so ensure it is on a drive with enough space.
You can point a tensorboard at that folder to look at graphs.
Here is an example tensorboard screenshot for a 5 epochs run: https://imgur.com/7FN1zLQ

Operating system support

Finetuning on Linux works as-is.
Finetuning on Windows is possible, but you must set this enviroment variable: PYTHONUTF8=1 either system-wide, or in the terminal session you're using, before invoking the finetuning script.
You can set the variable in CMD like this: set PYTHONUTF8=1
You can set the variable in a PowerShell like this: $Env:PYTHONUTF8 = 1
You can verify that Python is using UTF-8 mode by entering a python shell and using these commands:
import sys
print(sys.flags.utf8_mode)
It will print 1 if it is enabled.
This will switch the file loading/saving operations to use UTF-8, otherwise you'll get an error about an unsupported codepage.
Also you will want to specify paths using / forward slashes, rather than the backward \ slash notation common for windows just to be on the safe side.
Mixed-precision works under Windows as well.

Hardware Requirements

CPU performance matters since a lot of data is shuffled around. 16 CPU cores are a good baseline.
Trying to run this on 8 CPU cores on a fast GPU may bottleneck the process.
Regular system RAM usage is not very high. If you have 16-ish GB system RAM, you should be fine unless your dataset is truly huge.
You cannot have enough VRAM for this. More is better. More VRAM means bigger max_len and batch_size.

Quality comparisons

Dataset: Custom dataset for Garrus from Mass Effect. 30 emotions/styles tagged as speaker IDs, total duration about 5h50min. Custom text preprocessing. Audio includes flanger, this is not a model error but quite desired for a Turian voice. Using a custom OutOfDomain text dataset for SLM AT.
Epoch: 5 , for all examples. There is still room for improvement with more epochs.
Sampling: alpha=0.3, beta=0.7, diffusion_steps=10, embedding_scale=1

Text:
You are reading a discussion page on Github, imagine that! I think the human saying is: "Git good!" Wonder why they didn't choose "that" name.

(The "quoted" words are used for extra emphasis in my dataset)

Phoneme version:
juː ɑːɹ ɹˈiːdɪŋ ɐ dɪskˈʌʃən pˈeɪdʒ ˌɔn ɡˈɪthʌb , ɪmˈædʒɪn ðˈæt ! ˈaɪ θˈɪŋk ðə hjˈuːmən sˈeɪɪŋ ɪz : `` ɡˈɪt ɡˈʊd '' ! wˈʌndɚ wˌaɪ ðeɪ dˈɪdnt tʃˈuːz `` ðˈæt '' nˈeɪm .

batch_size: 4 , max_len: 100, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/11DDidEhJac5

batch_size: 6, max_len: 800, without style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/18PaQ8F248Hu

batch_size: 4 , max_len: 100, with style diffusion finetuning, without slm adversarial finetuning :
https://voca.ro/1bulPARTI2mn

batch_size: 2 , max_len: 175, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/1aqmfuqHS51N
Since running in batchsize 2 takes forever, I only trained for 2 final epochs with slm adversarial finetuning, and prior to that, up to epoch 4 with batchsize of 4 with 100 max_len.

batch_size: 4, max_len: 800, with style diffusion finetuning, with slm adversarial finetuning :
https://voca.ro/11QGDDMhWsNU

These quality examples don't reflect the maximum quality possible and are just for illustration purposes. :>

Hope this is useful.