HomeCortex

๐Ÿš€ Building HomeCortex: A Local-First Voice AI for Smart Homes and How I Got Back to My Engineering Roots

After years working on enterprise software, I wanted to get my hands dirty again. Not just to learn LLMs from the outside through APIs and abstractions but to understand them from the inside: how they run, how theyโ€™re optimized, how they actually behave at the edge.

I also wanted to reconnect with where I started: embedded electronics, C programming, low-level audio pipelines, real hardware that boots and talks back to you.

So I gave myself a constraint that forced both worlds to collide: build a fully local, privacy-first voice assistant for my home no cloud, no subscriptions, no data leaving the network.

The result is a two-part project :


What HomeCortex actually is

HomeCortex is a self-hosted AI backend that connects distributed ESP32-S3 voice satellites to Home Assistant. It processes natural language locally with an LLM and generates natural voice responses โ€” all on a Mac Mini sitting in a closet.

The full pipeline runs on local infrastructure:

  • STT โ€” Whisper MLX, optimized for Apple Silicon (~0.5s transcription)
  • LLM โ€” Ollama running Qwen2.5:3B (configurable)
  • TTS โ€” Piper (local, ~0.8s) or XTTS v2 for voice cloning, with ElevenLabs as an optional cloud fallback
  • Speaker ID โ€” pyannote.audio for personalized interactions
  • Memory โ€” a unified SQLite layer combining semantic facts, episodic history, and adaptive habits

The orchestration layer is FastAPI. The web companion interface is built in Go. Satellites talk to the backend over WiFi.


The memory architecture is what Iโ€™m most proud of

A voice assistant that forgets everything between sessions is just a smarter parser. To feel like an assistant, it needs continuity. I built three memory layers, all local:

  1. Semantic memory โ€” persistent facts about the household (โ€œEmmanuel prefers 19ยฐC at nightโ€), injected into the system prompt under a WHAT YOU KNOW ABOUT THE HOME section. Facts can be explicit, extracted from conversation, or inferred from patterns.
  2. Episodic memory โ€” the last N exchanges per satellite, isolated per room. Kitchen conversations donโ€™t leak into the bedroom. Continuity survives server restarts.
  3. Adaptive context โ€” a query_stats table tracking recurrent patterns. Once a pattern crosses a threshold, it gets promoted into runtime context as a learned habit (โ€œFavorite weather city: Geneva โ€” 47 requestsโ€).

This adaptive layer reduces clarification requests, improves intent prediction, and produces interactions that genuinely feel personalized over time.


The satellite: back to C and embedded

The satellite firmware (ESP-VoCat AI Voice Satellite) is where I deliberately went back to my roots. ESP-IDF, C, raw flash partitions, custom CMake configuration. No Arduino abstractions โ€” direct work with the ESP32-S3.

What runs on the device:

  • Local wake word (โ€œHey, Kiraโ€) via ESP-SR, with its own flash partition
  • I2S audio pipeline โ€” 16kHz WAV capture, MP3 playback through a built-in codec
  • Patch-based avatar animation on a 360ร—360 LCD, driven by real-time audio amplitude โ€” no GIFs, no video. Mouth frames are RGB565 raw buffers composited over the base face. Low CPU, low memory, high responsiveness.
  • Pre-recorded audio messages for low-latency confirmations (โ€œOui ?โ€, โ€œCโ€™est faitโ€, โ€œErreurโ€) โ€” TTS is reserved for content that actually requires generation
  • HTTP API on port 80 with token auth โ€” endpoints for /play, /playtxt, /wake, /volume, /status, /face/set, controllable from Home Assistant automations

The visual assets pipeline is Python: convert_kira_color.py composites mouth variations via chroma key, convert_kira_sprites.py converts PNGs to RGB565 raw buffers for the ESP32. Embedded discipline meets modern tooling.


Phase 1: Neural Optimization

Beyond the working system, Iโ€™ve been exploring model optimization for edge deployment โ€” INT8 dynamic quantization, structured pruning, and ONNX export for cross-platform compatibility. The goal: prepare the architecture for migration from Apple Silicon to dedicated edge AI hardware (Qualcomm Dragonwing IQ8, Arduino VENTUNO Q) without rewriting the application layer.

The backend is intentionally hardware-agnostic:

  • Whisper MLX โ†’ faster-whisper
  • Ollama โ†’ llama.cpp with GGUF
  • Piper โ†’ Coqui XTTS v2

Same runtime abstraction, different targets.


What I learned (and what I wanted to learn)

This project gave me what an LLM API tutorial never could:

  • Real intuition for model behavior at the edge โ€” latency vs. quality vs. memory, on actual constrained hardware
  • A working understanding of the full voice-AI stack โ€” STT, LLM inference, TTS, speaker ID, prompt engineering, tool use not as black boxes, but as components I had to wire together and debug
  • Renewed comfort with embedded C โ€” flash partitions, I2S, audio DMA, ESP-IDF build system, real-time constraints
  • The full systems-engineering loop โ€” from microphone capture on an MCU, through WiFi, through inference, through Home Assistant, back to a speaker in another room, in under a couple of seconds end-to-end

Both repositories are MIT licensed and open for anyone curious. Issues, forks, and conversations are very welcome.

๐Ÿ”— HomeCortex backend: https://github.com/colussim/HomeCortex

๐Ÿ”— ESP-myhome-EchoEar satellite: https://github.com/colussim/ESP-myhome-EchoEar