BUILDING VOICE ASSISTANT CONFIGURATIONS: ADVANCED OVOS SETUPS
After exploring each component of OVOS and Neon assistants, let’s examine how to mix and match these components to create custom configurations. The modular nature of OVOS allows for setups ranging from minimal text-only systems to complex distributed networks. Text-Only Assistants The simplest possible configuration requires just two components: Message bus Core/skills service This minimal setup can be useful for: Development and testing Accessibility (vision/hearing impaired users) Integration with existing text interfaces Command-line or web-based interaction In this configuration, the message flow is straightforward:
THE VOICE ASSISTANT'S BODY: UNDERSTANDING OVOS HARDWARE INTEGRATION
While previous articles covered how your assistant listens, thinks, and speaks, now we’ll explore how it interacts with physical hardware through the Platform Hardware Abstraction Layer (PHAL) system. Evolution of Hardware Support The PHAL system’s history helps explain its design. Originally, Mycroft AI’s code was tightly coupled to their Mark 1 device, which included: LED panel for eyes and mouth Custom Arduino-controlled circuit board Specific audio hardware configuration When Mycroft developed their Mark 2 device, they discovered this tight coupling made supporting new hardware difficult.
THE VOICE ASSISTANT'S BRAIN: UNDERSTANDING OVOS SKILLS
After exploring how your assistant communicates, speaks, listens, and controls hardware, let’s examine how it processes and responds to commands through the core/skills service. Core Service Overview The core service (called either “core” or “skills” depending on your OVOS implementation) coordinates two main components: Intent Engine Skills System These components work together to understand user requests and execute appropriate actions. Intent Engine The intent engine matches user requests with the appropriate skill.
THE VOICE ASSISTANT'S EARS: UNDERSTANDING OVOS LISTENER SERVICES
Continuing our exploration of OVOS and Neon.AI components, let’s examine how your assistant hears and understands spoken commands. The listener service (ovos-dinkum-listener) is like your assistant’s ears and early speech processing - it handles everything from detecting wake words to converting speech to text. Listener Architecture The listener service coordinates four critical components: Microphone input Wake word detection Voice Activity Detection (VAD) Speech-to-Text (STT) These components communicate through the message bus we discussed in the previous article, working together to turn audio into text commands your assistant can process.
THE VOICE ASSISTANT'S MOUTH: UNDERSTANDING OVOS AUDIO SERVICES
After exploring how your assistant listens in our previous article, let’s look at how it speaks and plays audio. The audio service (ovos-audio) handles all sound output, from spoken responses to music playback. Audio Service Overview Just as the listener service coordinates multiple components for hearing, the audio service manages two main components: Text-to-Speech (TTS) Audio playback These components communicate through the message bus we covered in part 1, responding to requests from skills and other services.
THE VOICE ASSISTANT'S NERVOUS SYSTEM: UNDERSTANDING THE OVOS MESSAGE BUS
OpenVoiceOS (OVOS) and Neon AI offer powerful options for creating private, local voice assistants. While most users can get started with ovos-installer or a Neon image, understanding how these assistants work internally helps you customize them effectively. Let’s start with the most fundamental component: the message bus. What is a Message Bus? Think of the message bus as your assistant’s nervous system - it’s how all the different parts communicate. Just as your nervous system carries signals between your brain, ears, and mouth, the message bus carries messages between your assistant’s components.
GUNICORN IN CONTAINERS
There is a dearth of information on how to run a Flask/Django app on Kubernetes using gunicorn (mostly for good reason!). What information is available is often conflicting and confusing. Based on issues I’ve seen with my customers at Defiance Digital in the last year or so, I developed a test repository to experiment with different configurations and see which is best. tl;dr The conventional wisdom to use multiple workers in a containerized instance of Flask/Django/anything that is served with gunicorn is incorrect - you should only use one or two workers per container, otherwise you’re not properly using the resources allocated to your application.
FASTERWHISPER STT SERVER SCRIPT
Running a FasterWhisper STT Server on a Raspberry Pi Over the course of the last year, I’ve spent a considerable amount of time helping Neon and OVOS users customize their voice assistants. OVOS and Neon are both incredibly flexible platforms, which makes them powerful, but also complex. The biggest hurdle to getting a voice assistant running on a single small machine, such as Raspberry Pi, has been Speech-To-Text (STT). STT models are large and require significantly more computing power than other portions of a voice assistant.
PIPER TTS SERVER SCRIPT
Running a Piper TTS Server on a Raspberry Pi Over the course of the last year, I’ve spent a considerable amount of time helping Neon and OVOS users customize their voice assistants. OVOS and Neon are both incredibly flexible platforms, which makes them powerful, but also complex. The two most frequently requested text-to-speech (TTS) options are Coqui TTS and Piper TTS. Coqui is the spiritual successor to Mozilla’s DeepSpeech and unfortunately no longer going to be supported.
MYCROFT MARK II TEARDOWN AND PI UPGRADE/REPLACEMENT
Note: This is a guest post from Chance Rosenthal with the OVOS Foundation. Caution: don’t do this! If, for some reason, you need to bust open and disassemble your Mark II in spite of that warning, here are some annotated photographs of just that process. Midway, we’ll swap out the factory Pi for an 8GB model to facilitate offline performance. Note about screws There are several different screws in play here.