Text-to-Speech

The point of this model was too see whether it was possible to train a multilingual model with very little data. The quality of the results varies from language to language. Spanish is fluent, Finnish and Swedish are accented but okay, Turkish is not great and French is a total mess. The state of Estonian and Hungarian is unknown.

Training was conducted using a subset of 3000 samples from each language in the Common Voice dataset.

Training Configuration

multilang_loss

                    - --learning_rate
                    - "0.0001"
                    - --batch_size_per_gpu
                    - "2000"
                    - --batch_size_type
                    - frame
                    - --max_samples
                    - "96"
                    - --grad_accumulation_steps
                    - "16"
                    - --max_grad_norm
                    - "0.3"
                    - --epochs
                    - "200"
                    - --num_warmup_updates
                    - "5000"
                    - --save_per_updates
                    - "10000"
                    - --keep_last_n_checkpoints
                    - "-1"
                    - --last_per_updates
                    - "5000"
                    - --tokenizer
                    - custom
                    - --bnb_optimizer

Inference Parameters

{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}

Thanks

Thanks to Amos Wallgren, Calvin Guillot and Begüm Çelik for quality assurance.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EkhoCollective/f5-tts-fi-sv-es-fr-tr-hu-et

Base model

SWivid/F5-TTS
Finetuned
(69)
this model

Dataset used to train EkhoCollective/f5-tts-fi-sv-es-fr-tr-hu-et