
超快速设备端多语言TTS,通过ONNX原生运行
月增8500+星,设备端高效TTS需求增长,ONNX跨平台特性吸引开发者
支持离线中文TTS,保护隐私,适合国产iOS应用快速集成语音合成能力
适合需要离线多语言语音合成的iOS应用,如阅读助手、导航播报
Supertonic is a lightning-fast, on-device multilingual text-to-speech system designed for local inference with minimal overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
lang="na" to let Supertonic process the text language-agnostically when you don't know the input language — no separate language adapters needed<laugh>, <breath>, <sigh>) bring natural human nuance into generated speech without prompt engineering or reference audioArabic (ar), Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latvian (lv), Lithuanian (lt), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi)
Not sure which language your text is in? Pass
lang="na"and Supertonic will handle the input in a language-agnostic way — no explicit language tag required.
supertonic serve, a local HTTP server with native /v1/tts and OpenAI-compatible /v1/audio/speech endpoints. See the serve documentation.release/supertonic-2 branch.supertonic PyPI package! Install via pip install supertonic. For details, visit supertonic-py documentationInstall the Python SDK and generate speech immediately. On the first run, Supertonic downloads the model assets from Hugging Face automatically.
pip install supertonic
from supertonic import TTS
# First run downloads the model from Hugging Face automatically.
tts = TTS(auto_download=True)
style = tts.get_voice_style(voice_name="M1")
text = "Supertonic is a lightning fast, on-device TTS system."
wav, duration = tts.synthesize(
text=text,
lang="en", # Language code (e.g., "en", "ko", "na" for language-agnostic)
voice_style=style, # Voice style object
total_steps=8, # Quality: 5 (low) to 12 (high), default 8 (medium)
speed=1.05, # Speed: 0.7 (slow) to 2.0 (fast)
)
# wav: numpy array of shape (1, num_samples,) with dtype=np.float32, sampled at 44100 Hz
# duration: numpy array of shape (1,) containing the duration of the generated audio in seconds
tts.save_audio(wav, "output.wav")
# import soundfile as sf
# sf.write("output.wav", wav.squeeze(), 44100)
print(f"Generated {duration[0]:.2f}s of audio")
The Python SDK can also run Supertonic as a local HTTP service. This is useful when you want to call Supertonic from tools that already speak HTTP, such as local agents, browser extensions, Electron apps, workflow automation tools, or OpenAI-compatible audio clients.
pip install 'supertonic[serve]'
supertonic serve --host 127.0.0.1 --port 7788
Once running, use the native POST /v1/tts endpoint or the OpenAI-compatible POST /v1/audio/speech endpoint. The server also exposes interactive OpenAPI docs at http://127.0.0.1:7788/docs. See the supertonic-py serve guide for request examples, batch synthesis, and custom Voice Builder JSON import.
First, clone the repository:
git clone https://github.com/supertone-inc/supertonic.git
cd supertonic
Before running the examples, download the ONNX models and preset voices, and place them in the assets directory:
Note: The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
- macOS:
brew install git-lfs && git lfs install- Generic: see
https://git-lfs.comfor installers
git lfs install
git clone https://huggingface.co/Supertone/supertonic-3 assets
Some language examples need native runtimes:
brew install onnxruntime is enough; the Go example auto-detects Homebrew paths.brew install openjdk@17 works.Then run the Python example:
cd py
uv sync
uv run example_onnx.py
This generates outputs/output.wav using the default preset voice.
Node.js Example (Details)
cd nodejs
npm install
npm start
Browser Example (Details)
cd web
npm install
npm run dev
Java Example (Details)
cd java
mvn clean install
mvn exec:java
C++ Example (Details)
cd cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx
C# Example (Details)
cd csharp
dotnet restore
dotnet run
Go Example (Details)
cd go
go mod download
go run example_onnx.go helper.go
Swift Example (Details)
cd swift
swift build -c release
.build/release/example_onnx
Rust Example (Details)
cd rust
cargo build --release
./target/release/example_onnx
iOS Example (Details)
cd ios/ExampleiOSApp
xcodegen generate
open ExampleiOSApp.xcodeproj
In Xcode: Targets → ExampleiOSApp → Signing: select your Team, then choose your iPhone as run destination and build.
Supertonic 3 is designed for practical on-device inference: compact enough to run locally, while staying competitive with much larger open TTS systems.
Evaluated on the Minimax-MLS-test benchmark, Supertonic 3 stays within a competitive WER/CER range against much larger open TTS models such as VoxCPM2, while preserving a lightweight on-device deployment path. Asterisked languages (*) use CER; the others use WER.
| Lang | VoxCPM2 | OmniVoice | Qwen3-TTS | Supertonic 2 | Supertonic 3 |
|---|---|---|---|---|---|
| arabic* | 4.14 | 1.74 | — | — | 2.14 |
| czech | 23.73 | 2.40 | — | — | 3.02 |
| dutch | 0.84 | 0.77 | — | — | 1.47 |
| english | 2.11 | 2.02 | 2.25 | 2.52 | 2.06 |
| finnish | 2.29 | 3.94 | — | — | 5.40 |
| french | 4.41 | 4.74 | 3.82 | 5.09 | 4.89 |
| german | 0.85 | 0.96 | 0.52 | — | 0.86 |
| greek | 3.22 | 2.96 | — | — | 3.54 |
| hindi* | 5.85 | 5.14 | — | — | 5.34 |
| indonesian | 1.25 | 1.67 | — | — | 1.34 |
| italian | 1.74 | 1.29 | 1.40 | — | 1.75 |
| japanese* | 3.35 | 3.81 | 3.67 | — | 4.61 |
| korean* | 4.70 | 3.22 | 4.07 | 3.65 | 3.26 |
| polish | 1.30 | 0.64 | — | — | 1.63 |
| portuguese | 1.74 | 1.40 | 1.21 | 1.52 | 2.48 |
| romanian | 22.39 | 2.29 | — | — | 2.19 |
| russian | 3.31 | 4.53 | 4.48 | — | 3.99 |
| spanish | 1.34 | 0.99 | 0.75 | 1.81 | 1.13 |
| turkish | 0.88 | 2.18 | — | — | 1.00 |
| ukrainian | 5.85 | 0.71 | — | — | 1.23 |
| vietnamese | 1.48 | 0.79 | — | — | 4.49 |
Lower is better.
*indicates CER (character error rate); all other rows use WER (word error rate). Dashes (—) indicate the model does not officially support the language or no result is available.
Compared with Supertonic 2, Supertonic 3 reduces repeat and skip failures, improves speaker similarity across the shared-language set, and expands language coverage from 5 to 31 languages. It keeps the v2-compatible public ONNX interface, so existing integrations can move to v3 with the same inference contract.
Supertonic 3 runs fast on CPU, even compared with larger baselines measured on A100 GPU, and uses substantially less memory. The open-weight fixed-voice setting does not require a GPU, which makes local, browser, and edge deployment much easier.
At about 99M parameters across the public ONNX assets, Supertonic 3 is much smaller than 0.7B to 2B class open TTS systems. The smaller model size is a practical advantage for download size, startup time, and on-device inference.
This open-weight repository focuses on fixed-voice, local TTS and does not include an official voice-cloning pipeline. If you want to bring your own voice to local Supertonic deployment, Voice Builder turns a short reference recording into version-specific JSON files for Supertonic 2 and Supertonic 3, so the same custom voice can move with you across supported Supertonic versions.
For a managed creation workflow, Supertonic 3 is now officially available in Supertone Play and the Supertone API. Use them when you want hosted content creation tools, diverse commercially usable preset voices, zero-shot voice cloning, or API-based integration without managing local model files. You can also listen to Supertonic 3 zero-shot samples on the official showcase.
Try it now: Experience Supertonic in your browser with our Interactive Demo, or get started with pre-trained models from Hugging Face Hub
Watch Supertonic running on a Raspberry Pi, demonstrating on-device, real-time text-to-speech synthesis:
https://github.com/user-attachments/assets/ea66f6d6-7bc5-4308-8a88-1ce3e07400d2
Experience Supertonic on an Onyx Boox Go 6 e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
https://github.com/user-attachments/assets/64980e58-ad91-423a-9623-78c2ffc13680
Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:
https://github.com/user-attachments/assets/cc8a45fc-5c3e-4b2c-8439-a14c3d00d91c
We provide ready-to-use TTS inference examples across multiple ecosystems:
| Language/Platform | Path | Description |
|---|---|---|
| Python | py/ |
ONNX Runtime inference |
| Node.js | nodejs/ |
Server-side JavaScript |
| Browser | web/ |
WebGPU/WASM inference |
| Java | java/ |
Cross-platform JVM |
| C++ | cpp/ |
High-performance C++ |
| C# | csharp/ |
.NET ecosystem |
| Go | go/ |
Go implementation |
| Swift | swift/ |
macOS applications |
| iOS | ios/ |
Native iOS apps |
| Rust | rust/ |
Memory-safe systems |
| Flutter | flutter/ |
Cross-platform apps |
For detailed usage instructions, please refer to the README.md in each language directory.
Supertonic is designed to handle complex, real-world text inputs that contain natural prose, punctuation, abbreviations, and proper nouns.
🎧 View audio samples more easily: Check out our Interactive Demo for a better viewing experience of all audio examples
Overview of Test Cases:
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|---|---|---|---|---|---|---|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
Text:
"The startup secured $5.2M in venture capital, a huge leap from their initial $450K seed round."
Challenges:
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Text:
"You can reach the hotel front desk at (212) 555-0142 ext. 402 anytime."
Challenges:
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Text:
"Our drone battery lasts 2.3h when flying at 30kph with full camera payload."
Challenges:
Audio Samples:
| System | Result | Audio Sample |
|---|---|---|
| Supertonic | ✅ | 🎧 Play Audio |
| ElevenLabs Flash v2.5 | ❌ | 🎧 Play Audio |
| OpenAI TTS-1 | ❌ | 🎧 Play Audio |
| Gemini 2.5 Flash TTS | ❌ | 🎧 Play Audio |
| VibeVoice Realtime 0.5B | ❌ | 🎧 Play Audio |
Note: These samples demonstrate how each system handles text normalization and pronunciation of complex expressions without requiring pre-processing or phonetic annotations.
| Project | Description | Links |
|---|---|---|
| TLDRL | Free, on-device TTS extension for reading any webpage | Chrome |
| Read Aloud | Open-source TTS browser extension | Chrome · Edge · GitHub |
| PageEcho | E-Book reader app for iOS | App Store |
| VoiceChat | On-device voice-to-voice LLM chatbot in the browser | Demo · GitHub |
| OmniAvatar | Talking avatar video generator from photo + speech | Demo |
| CopiloTTS | Kotlin Multiplatform TTS SDK via ONNX Runtime | GitHub |
| Aftertone | Local post-reply TTS for Cursor & Claude Code (Supertonic 3 ONNX, on-device daemon) | GitHub · Demo |
| Voice Mixer | PyQt5 tool for mixing and modifying voice styles | GitHub |
| Supertonic MNN | Lightweight library based on MNN (fp32/fp16/int8) | GitHub · PyPI |
| Transformers.js | Hugging Face's JS library with Supertonic support | GitHub PR · Demo |
| Pinokio | 1-click localhost cloud for Mac, Windows, and Linux | Pinokio · GitHub |
| Supertonic 3 | Supertonic 2 | Supertonic 1 | |
|---|---|---|---|
| Status | 🟢 Latest | Stable | Legacy |
| Parameters | ~99M | ~66M | ~66M |
| Languages | 31 | 5 | 1 (en) |
| Expression Tags | ✅ 10 tags | — | — |
| Code | main | release/supertonic-2 | — |
| Weights | 🤗 HF | 🤗 HF | 🤗 HF |
| Interactive Demo | 🤗 Space | 🤗 Space | 🤗 Space |
| Audio Samples | DemoPage | — | DemoPage |
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}
This paper describes the RobustSpeechFlow technique for improving the robustness and quality of text-to-speech generation by optimizing flow-matching trajectories against data variability
@misc{yang2026robustspeechflowlearningrobusttexttospeech,
title={RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching},
author={Jinhyeok Yang and Hyeongju Kim and Yechan Yu and Joon Byun and Frederik Bous and Juheon Lee},
year={2026},
eprint={2605.22083},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2605.22083},
}
This project's sample code is released under the MIT License. - see the LICENSE for details.
The accompanying model is released under the OpenRAIL-M License. - see the LICENSE file for details.
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the LICENSE for details.
Copyright (c) 2026 Supertone Inc.
同属 模型/推理 类型 · 适合同类用户的其他选择