Writing a personal AI assistant in Python

16:14
09.07.2025
akazant
50

Modern voice assistants are powerful applications that combine speech processing, machine learning, and integration with external APIs. In this article, we’ll break down how to create a basic personal assistant project in Python using libraries like whisper, webrtcvad, gTTS, and more. Our assistant will: listen to the microphone; detect the start and end of speech using VAD; convert speech to text via the Whisper model; send queries to a local LLM to generate a response; read the answer aloud using gTTS; start/stop recording with a keypress. The project can serve as a starting point for experiments as well as for prototyping real-world solutions.

Modern voice assistants are powerful applications that combine speech processing, machine learning, and integration with external APIs. In this article, we will look at how to create a basic personal assistant project in Python using libraries like whisper, webrtcvad, gTTS, and others. Our assistant will:

Listen to the microphone
Detect the beginning and end of speech using VAD (Voice Activity Detection)
Convert speech to text via the Whisper model
Send requests to a local LLM to generate responses
Read the answer aloud using gTTS
Start/stop recording with the spacebar

The project can serve both as a starting point for experiments and for prototyping real solutions.

🔧 Dependency installation

Before starting, make sure you have all the necessary libraries installed:

pip install numpy sounddevice keyboard whisper torch webrtcvad requests colorama gTTS

You’ll also need a local server with an LLM, for example LM Studio, listening at http://localhost:1234.

🎤 Audio processing and voice recording

The sounddevice library is used for working with audio. We create a recording stream with a sample rate of 16 kHz and wait for the spacebar to be pressed — this is our trigger for starting/stopping recording.

def record_audio():
    global recording
    print("Press Space to start recording...")
    with sd.InputStream(samplerate=SAMPLE_RATE, channels=CHANNELS, dtype=DTYPE, callback=callback):
        while True:
            if keyboard.is_pressed('space'):
                toggle_recording()
                while keyboard.is_pressed('space'):
                    pass
            time.sleep(0.1)

Each audio fragment is added to a buffer, which is then analyzed using VAD (webrtcvad) to determine speech presence.

🗣️ Speech recognition with Whisper

Whisper is one of the popular speech recognition models. We're using it via the whisper library, loading the medium model and using a GPU if available.

Read also:

Designing a motor controller with GitHub Copilot

model = whisper.load_model("medium").to(device)

Once speech ends (determined by pauses), the segment is fed to the model:

result = model.transcribe(audio_float, language="ru", verbose=None)
text = result["text"].strip()

💬 Generating a response from AI

To generate a response, we use, for example, a locally installed google/gemma-3-4B model through LM Studio, which allows us to run LLMs locally on our machine and creates an OpenAI API-compatible server.

After loading the google/gemma-3-4b model into LM Studio, you launch it in server mode. The HTTP server accepts JSON requests at http://localhost:1234/v1/chat/completions. Our Python script sends a text request there and receives a ready-made response from the model:

def generate_response(text):
    data = {
        "messages": [{"role": "user", "content": text}],
    }
    response = requests.post("http://localhost:1234/v1/chat/completions", json=data)
    return response.json()['choices'][0]['message']['content']

This approach lets you work with a powerful AI model without going online, keeping your data private and providing acceptable speed (depends on your CPU and GPU). Make sure you’ve chosen the correct model in LM Studio and clicked Run locally or Start server so the script can interact with it.

🎧 Text-to-speech conversion (TTS)

The gTTS (Google Text-to-Speech) library is used for voicing the response. It’s simple to use and great for beginners (module gTTS_module.py):

import io, os, contextlib
from gtts import gTTS
import pygame
from threading import Thread
import keyboard  # To track key presses

# Global variable to stop playback
_playing = False

def text_to_speech_withEsc(text: str, lang: str = 'ru'):
    """
    Converts text to speech and plays it.
    Playback can be stopped by pressing the Esc key.
    """
    try:
        # Generate audio in memory
        tts = gTTS(text=text, lang=lang)
        fp = io.BytesIO()
        tts.write_to_fp(fp)
        fp.seek(0)

        # Initialize Pygame and load audio from memory
        pygame.mixer.init()
        pygame.mixer.music.load(fp)
        pygame.mixer.music.play()

        # Play until finished or Esc is pressed
        while pygame.mixer.music.get_busy():
            if keyboard.is_pressed('esc'):
                pygame.mixer.music.stop()
                print("Playback stopped (Esc)")
                break

        pygame.mixer.quit()

    except Exception as e:
        print(f"Error during speech: {e}")
    finally:
        pass

🖌️ Color scheme and interface

To make the output more readable, we apply colors using colorama. You can choose between light and dark themes:

THEMES = {
    "light": {
        "user": Fore.BLUE,
        "assistant": Fore.LIGHTBLACK_EX,
        ...
    },
    "dark": {
        "user": Fore.CYAN,
        "assistant": Fore.LIGHTGREEN_EX,
        ...
    }
}

A “thinking” assistant animation has also been added during response generation:

loading_animation(duration=1, text="Generating response...")

🚀 Launch and operation

Start the LLM server and run the main script, press Space — and ask your question. The assistant will recognize it, send it to the model, receive a response, and read it out to you (file pers_assist.py).

import numpy as np
import sounddevice as sd
import keyboard
import whisper
import threading
import time
import torch
import webrtcvad
import requests
from colorama import Fore, Style, init
import time
import sounddevice as sd
import re
import gTTS_module2

# Colorama initialization
init(autoreset=True)

# === Color schemes ===
THEMES = {
    "light": {
        "user": Fore.BLUE,
        "assistant": Fore.LIGHTBLACK_EX,
        "thinking": Fore.MAGENTA,
        "background": Style.BRIGHT,
        "prompt": "Light"
    },
    "dark": {
        "user": Fore.CYAN,
        "assistant": Fore.LIGHTGREEN_EX,
        "thinking": Fore.YELLOW,
        "background": Style.DIM,
        "prompt": "Dark"
    }
}

THEME = THEMES["light"]
#print(f"\n✅ {THEME['prompt']} theme set\n")

# --- Settings ---
SAMPLE_RATE = 16000
CHANNELS = 1
DTYPE = np.int16

SEGMENT_DURATION = 0.02  # 20 ms for VAD
SEGMENT_SAMPLES = int(SAMPLE_RATE * SEGMENT_DURATION)

MIN_SPEECH_CHUNKS = 10     # minimum consecutive speech fragments
SILENCE_TIMEOUT = 1.5      # seconds waiting before new line

#['tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large', 'large-v3-turbo', 'turbo']  
# --- Whisper model initialization with CUDA support ---
device = "cuda" if torch.cuda.is_available() else "cpu"
#print(f"[Device used]: {device.upper()}")
model = whisper.load_model("medium").to(device)  #for example: model = whisper.load_model("small", device="cpu") 

# --- VAD initialization ---
vad = webrtcvad.Vad()
vad.set_mode(3 )  # sensitivity 0 - high, 3 - low

def is_speech(frame_bytes):
    try:
        return vad.is_speech(frame_bytes, SAMPLE_RATE)
    except: 
        return False

# --- Global variables ---
recording = False
audio_buffer = []
buffer_index = 0 
lock = threading.Lock()
last_speech_time = None

# --- Recording callback ---
def callback(indata, frames, time, status):
    if recording:
        with lock:
            audio_buffer.extend(indata.copy().flatten())

# --- Recording management ---
def record_audio():
    global recording
    print("Press Space to start recording...")
    with sd.InputStream(samplerate=SAMPLE_RATE, channels=CHANNELS, dtype=DTYPE, callback=callback):
        while True:
            if keyboard.is_pressed('space'):
                toggle_recording()
                while keyboard.is_pressed('space'):
                    pass
            time.sleep(0.1)

def toggle_recording():
    global recording, audio_buffer, buffer_index
    global speech_segment, speech_started, new_line_pending, current_pause, last_speech_time

    recording = not recording
    if recording:
        print("\n[Recording started...]")
        audio_buffer.clear()
        buffer_index = 0

        # Reset VAD state
        speech_segment = []   
        speech_started = False    
        new_line_pending = False
        current_pause = 0.0
        last_speech_time = time.time()  # ← update start time 
    else:
        print("[Recording stopped.]")

def generate_response(text):
    data = { 
    "messages": [
        {"role": "user", "content": text}       
    ],
    #"temperature": 0.0,        # minimal randomness
    #"max_tokens": 10,          # min. reply tokens
    #"stream": False,           # disables stream
    #"stop": ["\n"]             # stop after first line 
    } 
    response = requests.post(
        "http://localhost:1234/v1/chat/completions",
        json=data
    )
    assist_reply = response.json()['choices'][0]['message']['content']
    # Remove tags and contents between them
    #cleaned_text = re.sub(r'\.*?<\', '', assist_reply, flags=re.DOTALL)
    #print("Assistant's reply:", assist_reply) 
    return assist_reply 

# === Loading animation ===
def loading_animation(duration=1 , text="Thinking"):
    symbols = [ '⣷', '⣯', '⣟', '⡿', '⢿', '⣻', '⣽','⣾']
    end_time = time.time() + duration
    idx = 0
    while time.time() < end_time:
        print(f"\r{THEME['thinking']}[{symbols[idx % len(symbols)]}] {text}{Style.RESET_ALL}", end="")
        idx += 1
        time.sleep(0.1)
    print(" " * (len(text) + 6), end="\r")  # Clear line

def process_stream():
    global last_speech_time, buffer_index
    global speech_segment, speech_started, new_line_pending, current_pause
    global recording   

    while True:
        if not recording:
            time.sleep(0.5)
            continue
        question_text = ""        
        with lock:
            available = len(audio_buffer)

        while buffer_index + SEGMENT_SAMPLES <= available:
            segment = audio_buffer[buffer_index:buffer_index + SEGMENT_SAMPLES]
            buffer_index += SEGMENT_SAMPLES

            segment_np = np.array(segment, dtype=np.int16)
            frame_bytes = segment_np.tobytes()

            try:
                is_silence = not is_speech(frame_bytes)

                if not is_silence:
                    speech_segment.extend(segment)
                    speech_started = True
                    new_line_pending = False
                    last_speech_time = time.time()  # ← update speech time
                elif speech_started:
                    current_pause = time.time() - last_speech_time

                    if current_pause > SILENCE_TIMEOUT:
                        if speech_segment:
                            # Recognize and print
                            audio_float = np.array(speech_segment, dtype=np.float32) / 32768.0
                            result = model.transcribe(audio_float, language="ru", verbose=None)

                            text = result["text"].strip()
                            if text.startswith("Subtitle Editor"): # whisper bug, noise reaction
                                text = ""
                                continue
                            question_text += " " + text    
                            if text:
                                print(f"{THEME['user']}You: {Style.RESET_ALL}{text}" , end=" ", flush=True)

                            speech_segment = []

                        print()  # new line
                        speech_segment = []
                        speech_started = False
                        new_line_pending = False
                         # Generate response
                        loading_animation(text="Generating response...")
                        #print(f"\r{THEME['thinking']}[{symbols[idx % len(symbols)]}] {text}{Style.RESET_ALL}", end="")
                        #print(f"{THINKING_COLOR}Generating response...{RESET}", end="\r")
                        response = generate_response(question_text) 
                        print(f"{THEME['assistant']}Assistant: {response}{Style.RESET_ALL}")
                        question_text = "" 
                        recording = False
                        gTTS_module2.text_to_speech_withEsc(response)
                        recording = True

            except Exception as e: 
                print(f"[Error]: {e}")

      time.sleep(0.05)

# --- Entry point ---
if __name__ == "__main__":
    print("[Voice-assistant app started.]")
    threading.Thread(target=record_audio, daemon=True).start()
    threading.Thread(target=process_stream, daemon=True).start()

    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("\nExit.")

✅ Conclusion

The full project code is available on github. The voice assistant we created is a pilot project that can be developed into a fully-fledged AI assistant for home or office. It combines several technologies: sound processing, machine learning models, and API integration. The project can serve as a foundation for those interested in creating personal assistants.