CoreML
Collection
Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details β’ 12 items β’ Updated β’ 5
CoreML conversion of Qwen/Qwen3-TTS (0.6B) for on-device inference on Apple platforms.
Supports English and Chinese text-to-speech synthesis.
| Model | Description | Size |
|---|---|---|
qwen3_tts_lm_prefill_v9 |
LM KV-cache prefill (text + speaker conditioning) | ~2.8 GB |
qwen3_tts_lm_decode_v10 |
Autoregressive LM decode (CB0 codec token generation) | ~1.8 GB |
qwen3_tts_cp_prefill |
Code predictor prefill (CB1-15 conditioning) | ~432 MB |
qwen3_tts_cp_decode |
Code predictor decode (CB1-15 generation) | ~420 MB |
qwen3_tts_decoder_10s |
Audio decoder (16-codebook codes β 24kHz waveform) | ~436 MB |
speaker_embedding_official.npy |
Default speaker embedding (1024-dim) | 4 KB |
Total: ~5.9 GB
Text tokens + Speaker embedding
β
LM Prefill (KV cache initialization)
β
LM Decode (CB0 codec tokens, temperature=0.9, top_k=50)
β
Code Predictor Prefill + Decode (CB1-15 per frame)
β
Audio Decoder (16 codebooks β 24kHz waveform)
β
Silence trimming β Final audio
import FluidAudioTTS
let manager = Qwen3TtsManager()
try await manager.loadFromDirectory(modelDir)
let wav = try await manager.synthesize(
text: "Hello world",
tokenIds: [9707, 1879, ...], // Pre-tokenized with Qwen3 processor
useSpeaker: true
)
See FluidAudio for the full Swift framework.
Converted using coremltools from the original PyTorch weights. Conversion scripts are in the mobius repository.
Apache-2.0, inherited from Qwen/Qwen3-TTS.