Pocket TTS multilingual version achieves 100ms latency and 2.5x real-time generation speed on the mid-range mobile chip Helio G99 — open-source text-to-speech has finally crossed the threshold of mobile usability.

What this is

Pocket TTS is an open-source text-to-speech (TTS) project. This week it released multilingual models covering six languages: English, French, Spanish, German, Italian, and Portuguese, with an independent model for each language.

What's noteworthy is the engineering adaptation by community developers that followed immediately: based on KevinAHM's ONNX (a cross-platform model format) exporter and VolgaGerm's C++ optimization, selective int8 quantization was applied to model nodes (reducing some computations from high precision to 8-bit integers in exchange for speed). The benchmark results are quite impressive — AMD Ryzen 9 7950X desktop latency is about 30ms with a 13x real-time generation speed; MediaTek Helio G99 mobile latency is about 100ms with 2.5x real-time. Developers also provided a sample runner for the Unity engine and an Android beta.

Industry view

We note two signals: first, the inference speed of open-source TTS has entered the practical range, where 100ms latency is virtually imperceptible to the human ear; second, the combination of ONNX export + int8 quantization shows that "being able to run" no longer relies on high-end GPUs, and mid-range mobile chips can handle the task.

However, this does not mean cloud TTS will be replaced quickly. Independent models per language mean universal multilingual capability is still limited, and more complex language families like Chinese and Japanese are not yet covered; there is still a gap in timbre expressiveness and naturalness compared to commercial solutions like ElevenLabs. Some in the Reddit community also pointed out that while selective quantization is fast, the precision loss in certain nodes may cause perceptible quality degradation in long-text generation. This is an unavoidable trade-off for local small models.

Impact on regular people

For enterprise IT: Local TTS solutions reduce the compliance risks of uploading voice data to the cloud, which has practical significance for sensitive industries like finance and healthcare, but the six-language coverage is still narrow, making it more suitable for Western markets in the short term.

For individual professionals: Content creators have gained a zero-cost local voiceover toolchain, further lowering the post-production barrier for short videos and podcasts, but a single timbre remains a hard flaw.

For the consumer market: 100ms latency on mobile means offline voice assistants are technically feasible; the next step is seeing who integrates this capability into a product first.