Examples of a good fine-tune? · yl4579/StyleTTS2 · Discussion #65 (original) (raw)

I tinkered around with the config_ft.yml file and I discovered I can do style diffusion and SLM Adversarial Training in one session on my 4090. Batch_size is set to 2. batch_percentage is set to 1. Note this can also work on a 7900 xtx. I'm using Virtual console mode and I used nvtop to close any program eating up vram. Epoch is set to 100 because DiscLM is usually at 0. I'm using vokan as the base model. Most of the audio files I gathered had background music and noise so I used resemble-enhance (Denoising via Gradio App version, not commandline version) and the audacity plug in acon digital deverberate 3 on the audio files. Then I used the audacity plug in trim extend to add 200 milliseconds in the beginning and end of the audio files.
Violet Parr: 8 minutes of audio. I set Max_Len to 270. slmadv_params min_len is set to 270 and slmadv_params max_len is set to 270.
https://vocaroo.com/15gtNvhjeFi0
Hiccup: 7 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1b5ewbnxQ2aa
Branch: 11 minutes of audio. I set Max_Len to 256. slmadv_params min_len is set to 256 and slmadv_params max_len is set to 256. Model coudn't pronounce bouhuhuh-ned and buhuhuhuh.
https://vocaroo.com/1kwWiqOQuEvg
https://vocaroo.com/1f2epsXNHBtH
Poppy: 15 minutes of audio. I set Max_Len to 270. slmadv_params min_len is set to 270 and slmadv_params max_len is set to 270.
https://vocaroo.com/157F3j65bYLE
Arnold Shortman: 4 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1ovdUEp7rVl2
https://vocaroo.com/1j7VQQSLo4Zo
Mr. Delicious: 4 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1gJ3lSN6ut9t
Helga G. Pataki: 14 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1bGX3i15t9Wm
https://vocaroo.com/1erNAm3ZH9n9
https://vocaroo.com/1cjvt2EHB6xm
Merida: 11 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/118XYZ0Oq4t7
Judy Hopps: 21 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1jf0jlUL3kOw
Wilbur Robinson: 7 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1agReLZAIx0c
Zuko: 6 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280. Model coudn't pronounce Katara.
https://vocaroo.com/1i9Sn1TvaPTr
https://vocaroo.com/1bkYnTkqhsyj
Connor: 56 seconds of audio taken from the film, Ruby Gillman, Teenage Kraken and 5 minutes of audio taken from Jaboukie Young-White (Connor's voice actor) interviews. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/17W0YDHnkgky
Mara Jade (Heidi Shannon): 2 minutes and 30 seconds of audio taken from the video game, Star Wars: Jedi Knight - Mysteries of the Sith and 5 minutes of audio taken from elevenlabs. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1n04GhNXlon9
Luke Skywalker (Mark Hamill): 2 minutes of audio taken from a digital copy of Return of the Jedi and 7 minutes and 50 seconds taken from a 1983 interview. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280. Model coudn't pronounce Coruscant.
https://vocaroo.com/13KuwZNIMhuv

Edit:
I rented out a h100 from runpod again. I edit the config_ft.yml file with micro. I set batch_size to 4 and max_len to 500. I set slmadv_params min_len to 100 and slmadv_params max_len to 500 and batch_percentage to 1. And now SLM adversarial training has started to work.

Here's a screenshot of the vram usage.

This is what I did in runpod.
I update the repo.

apt update

I install these.

apt install aria2 p7zip-full curl jq micro

I use the pwd command to find directory / filepath infomation.

pwd

I put the training dataset in a zip file and then I upload it to either https://catbox.moe/ or https://litterbox.catbox.moe/ (Which lets you upload a 1GB file)

I download the vokan base model and zip file with aria2.

aria2c -x 16 -s 16 -k 1M https://archive.org/download/epoch_2nd_00012/epoch_2nd_00012.pth

wget https://archive.org/download/epoch_2nd_00012/epoch_2nd_00012.pth

or the gofile downloader.

https://github.com/ltsdw/gofile-downloader.

Link for vokan model.
https://huggingface.co/ShoukanLabs/Vokan

aria2c -x 16 -s 16 -k 1M https://files.catbox.moe/XXXXXXXX.zip

I unzip the file

7z x XXXXXXXX.zip

I download the gofile upload script file.

aria2c https://raw.githubusercontent.com/Sushrut1101/GoFile-Upload/master/upload.sh

I give the script permissions

chmod +rx upload.sh

I upload the pth file to https://gofile.io/

./upload.sh model.pth