ShoukanLabs/Vokan · Hugging Face (original) (raw)

A StyleTTS2 fine-tune, designed for expressiveness.

Vokan is an advanced finetuned StyleTTS2 model crafted for authentic and expressive zero-shot performance. Designed to serve as a better base model for further finetuning in the future! It leverages a diverse dataset and extensive training to generate high-quality synthesized speech. Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts. With over 6+ days worth of audio data and 672 diverse and expressive speakers, Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance. Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space. Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration.

You can read more about it on our article on DagsHub!

Vokan Samples!

Your browser does not support the audio element.

Acknowledgements!

DagsHub: Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular!
camenduru: Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow!

Conclusion!

V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support! This is where you come in, if you have any large single speaker datasets you'd like to contribute, in any language, you can contribute to our Vokan dataset. A large community dataset that combines a bunch of smaller single speaker datasets to create one big multispeaker one. You can upload your uberduck or FakeYou compliant datasets via the Vokan bot on the ShoukanLabs Discord Server. The more data we have, the better the models we produce will be!

This model is also available on DagsHub

Citations!

@misc{li2023styletts,
      title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models},
      author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani},
      year={2023},
      eprint={2306.07691},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

@misc{zen2019libritts,
      title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
      author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
      year={2019},
      eprint={1904.02882},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Christophe Veaux,  Junichi Yamagishi, Kirsten MacDonald,
"CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit",  
The Centre for Speech Technology Research (CSTR),
University of Edinburgh

License!

MIT