unsloth/Llama-OuteTTS-1.0-1B · Hugging Face (original) (raw)

Important Sampling Considerations

When using OuteTTS version 1.0, it is crucial to use the settings specified in the Sampling Configuration section.

The repetition penalty implementation is particularly important - this model requires penalization applied to a 64-token recent window, rather than across the entire context window. Penalizing the entire context will cause the model to produce broken or low-quality output.

Currently, llama.cpp delivers the most reliable and consistent output quality by default. Both llama.cpp and EXL2 support this windowed sampling approach, while Transformers doesn't.

To address this limitation, I've implemented a windowed repetition penalty for the Hugging Face Transformers backend in the OuteTTS library, which significantly improves output quality and resolves sampling issues, providing comparable results to llama.cpp.

OuteTTS Version 1.0

This update brings significant improvements in speech synthesis and voice cloning—delivering a more powerful, accurate, and user-friendly experience in a compact size.

What's New

1. Prompt Revamp & Dependency Removal

2. New Audio Encoder Model

3. Voice Cloning

4. Auto Text Alignment & Numerical Support

5. Multilingual Capabilities

Video Showcase

Your browser does not support the video tag.

Quick Start Guide

Getting started with OuteTTS is simple:

Installation

🔗 Installation instructions

Basic Usage

import outetts

# Initialize the interface
interface = outetts.Interface(
    config=outetts.ModelConfig.auto_config(
        model=outetts.Models.VERSION_1_0_SIZE_1B,
        # For llama.cpp backend
        backend=outetts.Backend.LLAMACPP,
        quantization=outetts.LlamaCppQuantization.FP16
        # For transformers backend
        # backend=outetts.Backend.HF,
    )
)

# Load the default speaker profile
speaker = interface.load_default_speaker("EN-FEMALE-1-NEUTRAL")

# Or create your own speaker profiles in seconds and reuse them instantly
# speaker = interface.create_speaker("path/to/audio.wav")
# interface.save_speaker(speaker, "speaker.json")
# speaker = interface.load_speaker("speaker.json")

# Generate speech
output = interface.generate(
    config=outetts.GenerationConfig(
        text="Hello, how are you doing?",
        generation_type=outetts.GenerationType.CHUNKED,
        speaker=speaker,
        sampler_config=outetts.SamplerConfig(
            temperature=0.4
        ),
    )
)

# Save to file
output.save("output.wav")

More Configuration Options

For advanced settings and customization, visit the official repository:
🔗 interface_usage.md

Usage Recommendations

Speaker Reference

The model is designed to be used with a speaker reference. Without one, it generates random vocal characteristics, often leading to lower-quality outputs. The model inherits the referenced speaker's emotion, style, and accent. When transcribing to other languages with the same speaker, you may observe the model retaining the original accent.

Multilingual Application

It is recommended to create a speaker profile in the language you intend to use. This helps achieve the best results in that specific language, including tone, accent, and linguistic features.

While the model supports cross-lingual speech, it still relies on the reference speaker. If the speaker has a distinct accent—such as British English—other languages may carry that accent as well.

Optimal Audio Length

Temperature Setting Recommendations

Testing shows that a temperature of 0.4 is an ideal starting point for accuracy (with the sampling settings below). However, some voice references may benefit from higher temperatures for enhanced expressiveness or slightly lower temperatures for more precise voice replication.

Verifying Speaker Encoding

If the cloned voice quality is subpar, check the encoded speaker sample.

interface.decode_and_save_speaker(speaker=your_speaker, path="speaker.wav")

The DAC audio reconstruction model is lossy, and samples with clipping, excessive loudness, or unusual vocal features may introduce encoding issues that impact output quality.

Sampling Configuration

For optimal results with this TTS model, use the following sampling settings.

Parameter Value
Temperature 0.4
Repetition Penalty 1.1
Repetition Range 64
Top-k 40
Top-p 0.9
Min-p 0.05

Model Specifications

Training Parameters

Pre-Training

Fine-Tuning

License Information

Acknowledgments

Ethical Use Guidelines

This text-to-speech model is intended for legitimate applications that enhance accessibility, creativity, and communication; prohibited uses include impersonation without consent, creation of deliberately misleading content, generation of harmful or harassing material, distribution of synthetic audio without proper disclosure, voice cloning without permission, and any uses that violate applicable laws, regulations, or copyrights.