- AI
- A
Developing a Voice Assistant on Rockchip. Part 2
Continuing the development of a DIY voice assistant on the Rockchip SOC platform.
I continue to develop a DIY voice assistant on the Rockchip SOC platform.
In the second part, we will talk about improvements in speech synthesis. We will teach our AI assistant to pronounce text containing complex entities for models, as well as make it smoother.
Text Normalization for Synthesis
Modern state-of-the-art speech synthesis systems are becoming increasingly self-learning. By processing hundreds of thousands of hours of sound and corresponding text in any format, they try to learn the correspondence, including for quite ambiguous cases.
Usually, the following types of incoming text pose difficulties for them:
Homonyms with ambiguous/context-dependent stress (e.g., замок/замОк)
Switching between languages in one sentence (I want a Samsung TV with HDTV support)
Formulas and other symbolic notations (In the 18th century, people didn’t know the formula e=mc2)
Ambiguous abbreviations (Arriving in Cherepovets, Mr. Ivanov met Miss Petrova, born in 1985)
Declined abbreviations (I had 3 r., they gave me another 2 r., and it became 5 r. Here, “r.” should be expanded either to “rubles” or “rubles”)
Numbers in different contexts (Owner of the car a100aa111, you hit my scooter. Transfer 5000 r. to card 0000-0000-0000-0000 or call phone number 8-800-888-88-88). There are infinite ways to read numbers: one digit at a time, grouped, or as one big number (eight — eight hundred... or eight billion eight hundred million...)
These subtleties are difficult to learn automatically with high reliability, so almost any synthesis system contains text preprocessing. Typically, this involves some combination of manually written and programmed rules and ML models for natural language processing.
At MWS AI, we have a separate team that trains specialized neural networks based on transformer architecture, as well as writes high-performance code in the Rust language, enabling the application of complex logic to each incoming line with minimal delays.
For our DIY solution, we are implementing the most basic version of preprocessing: we will break down numbers, transliterate them, remove unsupported characters, and cut the content of tags, including the thoughts of the language model itself.
We cannot rely on local LLM help for this task. The models on the device are too lightweight to reliably support prompts with output formatting for the Russian language. There is also no support for user-defined stress marks (when we can explicitly indicate that a word should be read as "z`amok") in mms-tts, unfortunately. Adding separate models for natural language processing to the pipeline is also not ideal due to slowing down what is already not a very fast request processing. Therefore, we will have text processing based on rules.
Moreover, our task becomes even more complicated by the need for quick responses to user queries. As soon as a new portion of text (not exceeding the maximum input sequence length for the model) arrives, it needs to be "cut" and sent for speech synthesis. But we cannot just trim the original text to the length limit. For example, the phrase “1985 г. р.” in the original text takes up only nine characters, but after being expanded to “thousand nine hundred eighty-five year of birth” — it already takes up 50.
Implementing even the most basic set of rules for this task quickly runs into a huge set of corner cases, and processing each of them significantly complicates the code.
We will approach the task systematically, using the help of our wise Chinese colleagues who can work without sleep or rest — in our case, it is Qwen.
However, if we simply send a request to it with a description like the one above, the final code will be full of bugs that will only be discovered at runtime and will constantly haunt us. Therefore, to implement something non-trivial that also requires reliability, I have developed the following scheme:
First, we discuss and approve the ADR document (Architecture Decision Records).
Then we ask to write the most complete set of unit tests.
Additionally, we write an integration test or a demo application.
And only then do we ask to implement the code and check its correctness.
Here’s the ADR we ended up with after long discussions and several attempts at implementation:
Architecture Decision Record: StreamTextProcessor for Embedded TTS
Status: Accepted
Date: February 3, 2026
Author: [Your name]
Version: 1.0
Write comment