Developing a Voice Assistant on Rockchip. Part 2

09:43
03.03.2026
vzaguskin
135

Continuing the development of a DIY voice assistant on the Rockchip SOC platform.

I continue to develop a DIY voice assistant on the Rockchip SOC platform.

In the second part, we will talk about improvements in speech synthesis. We will teach our AI assistant to pronounce text containing complex entities for models, as well as make it smoother.

Text Normalization for Synthesis

Modern state-of-the-art speech synthesis systems are becoming increasingly self-learning. By processing hundreds of thousands of hours of sound and corresponding text in any format, they try to learn the correspondence, including for quite ambiguous cases.

Usually, the following types of incoming text pose difficulties for them:

Homonyms with ambiguous/context-dependent stress (e.g., замок/замОк)
Switching between languages in one sentence (I want a Samsung TV with HDTV support)
Formulas and other symbolic notations (In the 18th century, people didn’t know the formula e=mc2)
Ambiguous abbreviations (Arriving in Cherepovets, Mr. Ivanov met Miss Petrova, born in 1985)
Declined abbreviations (I had 3 r., they gave me another 2 r., and it became 5 r. Here, “r.” should be expanded either to “rubles” or “rubles”)
Numbers in different contexts (Owner of the car a100aa111, you hit my scooter. Transfer 5000 r. to card 0000-0000-0000-0000 or call phone number 8-800-888-88-88). There are infinite ways to read numbers: one digit at a time, grouped, or as one big number (eight — eight hundred... or eight billion eight hundred million...)

These subtleties are difficult to learn automatically with high reliability, so almost any synthesis system contains text preprocessing. Typically, this involves some combination of manually written and programmed rules and ML models for natural language processing.

At MWS AI, we have a separate team that trains specialized neural networks based on transformer architecture, as well as writes high-performance code in the Rust language, enabling the application of complex logic to each incoming line with minimal delays.

For our DIY solution, we are implementing the most basic version of preprocessing: we will break down numbers, transliterate them, remove unsupported characters, and cut the content of tags, including the thoughts of the language model itself.

We cannot rely on local LLM help for this task. The models on the device are too lightweight to reliably support prompts with output formatting for the Russian language. There is also no support for user-defined stress marks (when we can explicitly indicate that a word should be read as "z`amok") in mms-tts, unfortunately. Adding separate models for natural language processing to the pipeline is also not ideal due to slowing down what is already not a very fast request processing. Therefore, we will have text processing based on rules.

Moreover, our task becomes even more complicated by the need for quick responses to user queries. As soon as a new portion of text (not exceeding the maximum input sequence length for the model) arrives, it needs to be "cut" and sent for speech synthesis. But we cannot just trim the original text to the length limit. For example, the phrase “1985 г. р.” in the original text takes up only nine characters, but after being expanded to “thousand nine hundred eighty-five year of birth” — it already takes up 50.

Implementing even the most basic set of rules for this task quickly runs into a huge set of corner cases, and processing each of them significantly complicates the code.

Architecture Decision Record: StreamTextProcessor for Embedded TTS

Status: Accepted
Date: February 3, 2026
Author: [Your name]
Version: 1.0

1. Motivation: Why is this needed

Read also:

NetCraft: Castle Capture, Orcs, and BGP. How We Developed a Strategy for Network Engineers

Why ports became "doors" to the server, and who decided that SSH will be 22

3 Tbps in plain terms: why DDoS became a threat to any business in 2026

Developing a Voice Assistant on Rockchip. Part 2

Continuing the development of a DIY voice assistant on the Rockchip SOC platform.

Text Normalization for Synthesis

Architecture Decision Record: StreamTextProcessor for Embedded TTS

1. Motivation: Why is this needed

Read also:

NetCraft: Castle Capture, Orcs, and BGP. How We Developed a Strategy for Network Engineers

Why ports became "doors" to the server, and who decided that SSH will be 22

3 Tbps in plain terms: why DDoS became a threat to any business in 2026

Write comment

Relevant news on the topic "AI"

You are already using an agent. You just haven't noticed.

Sword Art Online is closer than it seems. Four layers of full-dive VR

Visitor tracking on fisheye cameras: a starred challenge

AI Agents and Code Generation

Claude Code Agent View Update: Now a Single Window to Manage Dozens of Parallel AI Sessions

AI fires and replaces some people. Others start getting paid €10,000 a month

We tested Claude Code's Dynamic Workflows on a real project. Here's what worked and what didn't

Also read

Artificial Intelligence in Education: Digital Profiles, Avatars and Personal Learning Paths

Curiosity as an operating system

Also read

How a single jammed Xerox printer sheet led to the creation of GNU Linux and the entire Open Source philosophy

We tested Claude Code's Dynamic Workflows on a real project. Here's what worked and what didn't

Five May 2026 Single-Board Computers: Intel N300, RISC-V with AI and Unreleased Raspberry Pi 6

Revolution in One Click: The History of the Start Menu

Developing a Voice Assistant on Rockchip. Part 2

Continuing the development of a DIY voice assistant on the Rockchip SOC platform.

Text Normalization for Synthesis

Architecture Decision Record: StreamTextProcessor for Embedded TTS

1. Motivation: Why is this needed Read also: NetCraft: Castle Capture, Orcs, and BGP. How We Developed a Strategy for Network Engineers Why ports became "doors" to the server, and who decided that SSH will be 22 3 Tbps in plain terms: why DDoS became a threat to any business in 2026

Relevant news on the topic "AI"

Also read

Also read

1. Motivation: Why is this needed

Read also:

NetCraft: Castle Capture, Orcs, and BGP. How We Developed a Strategy for Network Engineers

Why ports became "doors" to the server, and who decided that SSH will be 22

3 Tbps in plain terms: why DDoS became a threat to any business in 2026