- AI
- A
Writing a personal AI assistant in Python
Modern voice assistants are powerful applications that combine speech processing, machine learning, and integration with external APIs. In this article, we’ll break down how to create a basic personal assistant project in Python using libraries like whisper, webrtcvad, gTTS, and more. Our assistant will: listen to the microphone; detect the start and end of speech using VAD; convert speech to text via the Whisper model; send queries to a local LLM to generate a response; read the answer aloud using gTTS; start/stop recording with a keypress. The project can serve as a starting point for experiments as well as for prototyping real-world solutions.
Modern voice assistants are powerful applications that combine speech processing, machine learning, and integration with external APIs. In this article, we will look at how to create a basic personal assistant project in Python using libraries like whisper, webrtcvad, gTTS, and others. Our assistant will:
Listen to the microphone
Detect the beginning and end of speech using VAD (Voice Activity Detection)
Convert speech to text via the Whisper model
Send requests to a local LLM to generate responses
Read the answer aloud using gTTS
Start/stop recording with the spacebar
The project can serve both as a starting point for experiments and for prototyping real solutions.
🔧 Dependency installation
Before starting, make sure you have all the necessary libraries installed:
pip install numpy sounddevice keyboard whisper torch webrtcvad requests colorama gTTS
You’ll also need a local server with an LLM, for example LM Studio, listening at http://localhost:1234
.
🎤 Audio processing and voice recording
The sounddevice library is used for working with audio. We create a recording stream with a sample rate of 16 kHz and wait for the spacebar to be pressed — this is our trigger for starting/stopping recording.
def record_audio():
global recording
print("Press Space to start recording...")
with sd.InputStream(samplerate=SAMPLE_RATE, channels=CHANNELS, dtype=DTYPE, callback=callback):
while True:
if keyboard.is_pressed('space'):
toggle_recording()
while keyboard.is_pressed('space'):
pass
time.sleep(0.1)
Each audio fragment is added to a buffer, which is then analyzed using VAD (webrtcvad) to determine speech presence.
🗣️ Speech recognition with Whisper
Whisper is one of the popular speech recognition models. We're using it via the whisper library, loading the medium model and using a GPU if available.
model = whisper.load_model("medium").to(device)
Once speech ends (determined by pauses), the segment is fed to the model:
result = model.transcribe(audio_float, language="ru", verbose=None)
text = result["text"].strip()
💬 Generating a response from AI
To generate a response, we use, for example, a locally installed google/gemma-3-4B model through LM Studio, which allows us to run LLMs locally on our machine and creates an OpenAI API-compatible server.
After loading the google/gemma-3-4b model into LM Studio, you launch it in server mode. The HTTP server accepts JSON requests at http://localhost:1234/v1/chat/completions. Our Python script sends a text request there and receives a ready-made response from the model:
def generate_response(text):
data = {
"messages": [{"role": "user", "content": text}],
}
response = requests.post("http://localhost:1234/v1/chat/completions", json=data)
return response.json()['choices'][0]['message']['content']
This approach lets you work with a powerful AI model without going online, keeping your data private and providing acceptable speed (depends on your CPU and GPU). Make sure you’ve chosen the correct model in LM Studio and clicked Run locally or Start server so the script can interact with it.
🎧 Text-to-speech conversion (TTS)
The gTTS (Google Text-to-Speech) library is used for voicing the response. It’s simple to use and great for beginners (module gTTS_module.py):
import io, os, contextlib
from gtts import gTTS
import pygame
from threading import Thread
import keyboard # To track key presses
# Global variable to stop playback
_playing = False
def text_to_speech_withEsc(text: str, lang: str = 'ru'):
"""
Converts text to speech and plays it.
Playback can be stopped by pressing the Esc key.
"""
try:
# Generate audio in memory
tts = gTTS(text=text, lang=lang)
fp = io.BytesIO()
tts.write_to_fp(fp)
fp.seek(0)
# Initialize Pygame and load audio from memory
pygame.mixer.init()
pygame.mixer.music.load(fp)
pygame.mixer.music.play()
# Play until finished or Esc is pressed
while pygame.mixer.music.get_busy():
if keyboard.is_pressed('esc'):
pygame.mixer.music.stop()
print("Playback stopped (Esc)")
break
pygame.mixer.quit()
except Exception as e:
print(f"Error during speech: {e}")
finally:
pass
🖌️ Color scheme and interface
To make the output more readable, we apply colors using colorama. You can choose between light and dark themes:
THEMES = {
"light": {
"user": Fore.BLUE,
"assistant": Fore.LIGHTBLACK_EX,
...
},
"dark": {
"user": Fore.CYAN,
"assistant": Fore.LIGHTGREEN_EX,
...
}
}
A “thinking” assistant animation has also been added during response generation:
loading_animation(duration=1, text="Generating response...")
🚀 Launch and operation
Start the LLM server and run the main script, press Space — and ask your question. The assistant will recognize it, send it to the model, receive a response, and read it out to you (file pers_assist.py).
import numpy as np
import sounddevice as sd
import keyboard
import whisper
import threading
import time
import torch
import webrtcvad
import requests
from colorama import Fore, Style, init
import time
import sounddevice as sd
import re
import gTTS_module2
# Colorama initialization
init(autoreset=True)
# === Color schemes ===
THEMES = {
"light": {
"user": Fore.BLUE,
"assistant": Fore.LIGHTBLACK_EX,
"thinking": Fore.MAGENTA,
"background": Style.BRIGHT,
"prompt": "Light"
},
"dark": {
"user": Fore.CYAN,
"assistant": Fore.LIGHTGREEN_EX,
"thinking": Fore.YELLOW,
"background": Style.DIM,
"prompt": "Dark"
}
}
THEME = THEMES["light"]
#print(f"\n✅ {THEME['prompt']} theme set\n")
# --- Settings ---
SAMPLE_RATE = 16000
CHANNELS = 1
DTYPE = np.int16
SEGMENT_DURATION = 0.02 # 20 ms for VAD
SEGMENT_SAMPLES = int(SAMPLE_RATE * SEGMENT_DURATION)
MIN_SPEECH_CHUNKS = 10 # minimum consecutive speech fragments
SILENCE_TIMEOUT = 1.5 # seconds waiting before new line
#['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo']
# --- Whisper model initialization with CUDA support ---
device = "cuda" if torch.cuda.is_available() else "cpu"
#print(f"[Device used]: {device.upper()}")
model = whisper.load_model("medium").to(device) #for example: model = whisper.load_model("small", device="cpu")
# --- VAD initialization ---
vad = webrtcvad.Vad()
vad.set_mode(3 ) # sensitivity 0 - high, 3 - low
def is_speech(frame_bytes):
try:
return vad.is_speech(frame_bytes, SAMPLE_RATE)
except:
return False
# --- Global variables ---
recording = False
audio_buffer = []
buffer_index = 0
lock = threading.Lock()
last_speech_time = None
# --- Recording callback ---
def callback(indata, frames, time, status):
if recording:
with lock:
audio_buffer.extend(indata.copy().flatten())
# --- Recording management ---
def record_audio():
global recording
print("Press Space to start recording...")
with sd.InputStream(samplerate=SAMPLE_RATE, channels=CHANNELS, dtype=DTYPE, callback=callback):
while True:
if keyboard.is_pressed('space'):
toggle_recording()
while keyboard.is_pressed('space'):
pass
time.sleep(0.1)
def toggle_recording():
global recording, audio_buffer, buffer_index
global speech_segment, speech_started, new_line_pending, current_pause, last_speech_time
recording = not recording
if recording:
print("\n[Recording started...]")
audio_buffer.clear()
buffer_index = 0
# Reset VAD state
speech_segment = []
speech_started = False
new_line_pending = False
current_pause = 0.0
last_speech_time = time.time() # ← update start time
else:
print("[Recording stopped.]")
def generate_response(text):
data = {
"messages": [
{"role": "user", "content": text}
],
#"temperature": 0.0, # minimal randomness
#"max_tokens": 10, # min. reply tokens
#"stream": False, # disables stream
#"stop": ["\n"] # stop after first line
}
response = requests.post(
"http://localhost:1234/v1/chat/completions",
json=data
)
assist_reply = response.json()['choices'][0]['message']['content']
# Remove tags and contents between them
#cleaned_text = re.sub(r'\.*?<\ ', '', assist_reply, flags=re.DOTALL)
#print("Assistant's reply:", assist_reply)
return assist_reply
# === Loading animation ===
def loading_animation(duration=1 , text="Thinking"):
symbols = [ '⣷', '⣯', '⣟', '⡿', '⢿', '⣻', '⣽','⣾']
end_time = time.time() + duration
idx = 0
while time.time() < end_time:
print(f"\r{THEME['thinking']}[{symbols[idx % len(symbols)]}] {text}{Style.RESET_ALL}", end="")
idx += 1
time.sleep(0.1)
print(" " * (len(text) + 6), end="\r") # Clear line
def process_stream():
global last_speech_time, buffer_index
global speech_segment, speech_started, new_line_pending, current_pause
global recording
while True:
if not recording:
time.sleep(0.5)
continue
question_text = ""
with lock:
available = len(audio_buffer)
while buffer_index + SEGMENT_SAMPLES <= available:
segment = audio_buffer[buffer_index:buffer_index + SEGMENT_SAMPLES]
buffer_index += SEGMENT_SAMPLES
segment_np = np.array(segment, dtype=np.int16)
frame_bytes = segment_np.tobytes()
try:
is_silence = not is_speech(frame_bytes)
if not is_silence:
speech_segment.extend(segment)
speech_started = True
new_line_pending = False
last_speech_time = time.time() # ← update speech time
elif speech_started:
current_pause = time.time() - last_speech_time
if current_pause > SILENCE_TIMEOUT:
if speech_segment:
# Recognize and print
audio_float = np.array(speech_segment, dtype=np.float32) / 32768.0
result = model.transcribe(audio_float, language="ru", verbose=None)
text = result["text"].strip()
if text.startswith("Subtitle Editor"): # whisper bug, noise reaction
text = ""
continue
question_text += " " + text
if text:
print(f"{THEME['user']}You: {Style.RESET_ALL}{text}" , end=" ", flush=True)
speech_segment = []
print() # new line
speech_segment = []
speech_started = False
new_line_pending = False
# Generate response
loading_animation(text="Generating response...")
#print(f"\r{THEME['thinking']}[{symbols[idx % len(symbols)]}] {text}{Style.RESET_ALL}", end="")
#print(f"{THINKING_COLOR}Generating response...{RESET}", end="\r")
response = generate_response(question_text)
print(f"{THEME['assistant']}Assistant: {response}{Style.RESET_ALL}")
question_text = ""
recording = False
gTTS_module2.text_to_speech_withEsc(response)
recording = True
except Exception as e:
print(f"[Error]: {e}")
time.sleep(0.05)
# --- Entry point ---
if __name__ == "__main__":
print("[Voice-assistant app started.]")
threading.Thread(target=record_audio, daemon=True).start()
threading.Thread(target=process_stream, daemon=True).start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
print("\nExit.")
✅ Conclusion
The full project code is available on github. The voice assistant we created is a pilot project that can be developed into a fully-fledged AI assistant for home or office. It combines several technologies: sound processing, machine learning models, and API integration. The project can serve as a foundation for those interested in creating personal assistants.
Write comment