The point of this model was too see whether it was possible to train a multilingual model with very little data. The quality of the results varies from language to language. Spanish is fluent, Finnish and Swedish are accented but okay, Turkish is not great and French is a total mess. The state of Estonian and Hungarian is unknown.
Training was conducted using a subset of 3000 samples from each language in the Common Voice dataset.
Training Configuration
- --learning_rate
- "0.0001"
- --batch_size_per_gpu
- "2000"
- --batch_size_type
- frame
- --max_samples
- "96"
- --grad_accumulation_steps
- "16"
- --max_grad_norm
- "0.3"
- --epochs
- "200"
- --num_warmup_updates
- "5000"
- --save_per_updates
- "10000"
- --keep_last_n_checkpoints
- "-1"
- --last_per_updates
- "5000"
- --tokenizer
- custom
- --bnb_optimizer
Inference Parameters
{dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4}
Thanks
Thanks to Amos Wallgren, Calvin Guillot and Begüm Çelik for quality assurance.
Model tree for EkhoCollective/f5-tts-fi-sv-es-fr-tr-hu-et
Base model
SWivid/F5-TTS