AI, EdgeAI, LLM, LocalAI, ESP32, EmbeddedSystems, VoiceAssistant, OpenSource, HomeAssistant, MachineLearning

HomeCortex

by Emmanuel COLUSSI

May 11, 2026

🚀 Building HomeCortex: A Local-First Voice AI for Smart Homes and How I Got Back to My Engineering Roots

After years working on enterprise software, I wanted to get my hands dirty again. Not just to learn LLMs from the outside through APIs and abstractions but to understand them from the inside: how they run, how they’re optimized, how they actually behave at the edge.

I also wanted to reconnect with where I started: embedded electronics, C programming, low-level audio pipelines, real hardware that boots and talks back to you.

So I gave myself a constraint that forced both worlds to collide: build a fully local, privacy-first voice assistant for my home no cloud, no subscriptions, no data leaving the network.

The result is a two-part project :

Backend (the brain): github.com/colussim/HomeCortex
Satellite (the ears & voice): github.com/colussim/ESP-myhome-EchoEar

What HomeCortex actually is

HomeCortex is a self-hosted AI backend that connects distributed ESP32-S3 voice satellites to Home Assistant. It processes natural language locally with an LLM and generates natural voice responses — all on a Mac Mini sitting in a closet.

The full pipeline runs on local infrastructure:

STT — Whisper MLX, optimized for Apple Silicon (~0.5s transcription)
LLM — Ollama running Qwen2.5:3B (configurable)
TTS — Piper (local, ~0.8s) or XTTS v2 for voice cloning, with ElevenLabs as an optional cloud fallback
Speaker ID — pyannote.audio for personalized interactions
Memory — a unified SQLite layer combining semantic facts, episodic history, and adaptive habits

The orchestration layer is FastAPI. The web companion interface is built in Go. Satellites talk to the backend over WiFi.

The memory architecture is what I’m most proud of

A voice assistant that forgets everything between sessions is just a smarter parser. To feel like an assistant, it needs continuity. I built three memory layers, all local:

Semantic memory — persistent facts about the household (“Emmanuel prefers 19°C at night”), injected into the system prompt under a WHAT YOU KNOW ABOUT THE HOME section. Facts can be explicit, extracted from conversation, or inferred from patterns.
Episodic memory — the last N exchanges per satellite, isolated per room. Kitchen conversations don’t leak into the bedroom. Continuity survives server restarts.
Adaptive context — a query_stats table tracking recurrent patterns. Once a pattern crosses a threshold, it gets promoted into runtime context as a learned habit (“Favorite weather city: Geneva — 47 requests”).

This adaptive layer reduces clarification requests, improves intent prediction, and produces interactions that genuinely feel personalized over time.

The satellite: back to C and embedded

The satellite firmware (ESP-VoCat AI Voice Satellite) is where I deliberately went back to my roots. ESP-IDF, C, raw flash partitions, custom CMake configuration. No Arduino abstractions — direct work with the ESP32-S3.

What runs on the device:

Local wake word (“Hey, Kira”) via ESP-SR, with its own flash partition
I2S audio pipeline — 16kHz WAV capture, MP3 playback through a built-in codec
Patch-based avatar animation on a 360×360 LCD, driven by real-time audio amplitude — no GIFs, no video. Mouth frames are RGB565 raw buffers composited over the base face. Low CPU, low memory, high responsiveness.
Pre-recorded audio messages for low-latency confirmations (“Oui ?”, “C’est fait”, “Erreur”) — TTS is reserved for content that actually requires generation
HTTP API on port 80 with token auth — endpoints for /play, /playtxt, /wake, /volume, /status, /face/set, controllable from Home Assistant automations

The visual assets pipeline is Python: convert_kira_color.py composites mouth variations via chroma key, convert_kira_sprites.py converts PNGs to RGB565 raw buffers for the ESP32. Embedded discipline meets modern tooling.

Phase 1: Neural Optimization

Beyond the working system, I’ve been exploring model optimization for edge deployment — INT8 dynamic quantization, structured pruning, and ONNX export for cross-platform compatibility. The goal: prepare the architecture for migration from Apple Silicon to dedicated edge AI hardware (Qualcomm Dragonwing IQ8, Arduino VENTUNO Q) without rewriting the application layer.

The backend is intentionally hardware-agnostic:

Whisper MLX → faster-whisper
Ollama → llama.cpp with GGUF
Piper → Coqui XTTS v2

Same runtime abstraction, different targets.

What I learned (and what I wanted to learn)

This project gave me what an LLM API tutorial never could:

Real intuition for model behavior at the edge — latency vs. quality vs. memory, on actual constrained hardware
A working understanding of the full voice-AI stack — STT, LLM inference, TTS, speaker ID, prompt engineering, tool use not as black boxes, but as components I had to wire together and debug
Renewed comfort with embedded C — flash partitions, I2S, audio DMA, ESP-IDF build system, real-time constraints
The full systems-engineering loop — from microphone capture on an MCU, through WiFi, through inference, through Home Assistant, back to a speaker in another room, in under a couple of seconds end-to-end

Both repositories are MIT licensed and open for anyone curious. Issues, forks, and conversations are very welcome.

🔗 HomeCortex backend: https://github.com/colussim/HomeCortex

🔗 ESP-myhome-EchoEar satellite: https://github.com/colussim/ESP-myhome-EchoEar