The Voice Assistant's Ears: Understanding OVOS Listener Services
Continuing our exploration of OVOS and Neon.AI components, let’s examine how your assistant hears and understands spoken commands. The listener service (ovos-dinkum-listener
) is like your assistant’s ears and early speech processing - it handles everything from detecting wake words to converting speech to text.
Listener Architecture
The listener service coordinates four critical components:
- Microphone input
- Wake word detection
- Voice Activity Detection (VAD)
- Speech-to-Text (STT)
These components communicate through the message bus we discussed in the previous article, working together to turn audio into text commands your assistant can process.
Component Details
1. Microphone Plugins
Microphone plugins capture audio input from your hardware. OVOS supports multiple backends:
- alsa (recommended for Linux)
- pyaudio (most reliable for Windows/MacOS, also works with Linux)
- sounddevice (experimental, works for all 3 major platforms)
Browse available microphone plugins to find one that matches your system.
2. Wake Word Detection
Wake word detection (also called hotword detection) listens for specific trigger phrases like “Hey Mycroft” or “Hey Neon”. Popular options include:
-
OpenWakeWord:
- Modern synthetic training approach
- Adopted by Home Assistant
- Pre-trained models available
- Can create custom wake words
-
Precise-Lite:
- Lightweight Mycroft engine
- Good for resource-constrained devices (minimum suggested: Raspberry Pi 3B+)
-
VOSK:
- Text-based detection
- No custom model training needed
- More flexible but less precise
Browse wake word plugins for more options.
3. Voice Activity Detection (VAD)
VAD determines when someone is speaking versus background noise. This helps:
- Reduce processing load
- Trim silence from recordings
- Improve STT accuracy
Popular VAD options:
- Silero: Community favorite, works across languages. Speech detection, so it’s more accurate than noise-based VAD.
- WebRTC: Lightweight, good for web integration. Noise-based, so it’s less accurate but faster.
Browse VAD plugins for more choices.
4. Speech-to-Text (STT)
STT converts spoken audio to text. This is typically the most resource-intensive component. Current leading options:
-
NVIDIA NeMo (Citrinet model):
- Fast with GPU acceleration
- Community-optimized CPU versions available
- Neon offers a Raspberry Pi-optimized version
-
OpenAI Whisper:
- Excellent accuracy
- Multiple model sizes
- GPU recommended but CPU versions available
Browse STT plugins for more options.
How It All Works Together
Let’s follow the flow of a voice command:
- The listener connects to the message bus
- Microphone plugin starts capturing audio
- Wake word plugin monitors the audio stream
- When wake word is detected:
- VAD starts monitoring for speech
- STT prepares to receive audio
- When speech is detected:
- VAD trims silence
- Audio is sent to STT
- STT converts speech to text
- Text is sent over message bus to other components
This message flow looks something like:
// Wake word detected
{
"type": "recognizer_loop:wakeword",
"data": {
"wake_word": "hey_mycroft"
}
}
// Speech detected and processed
{
"type": "recognizer_loop:utterance",
"data": {
"utterances": ["what's the weather like today"]
}
}
Resource Considerations
STT is usually the most demanding component. You have several options:
-
Full Local Processing:
- Requires decent CPU or GPU (not recommended on Raspberry Pi boards yet)
- Complete privacy
- Fastest response time
-
Distributed Processing:
- Run STT on a more powerful machine
- Use lighter models on satellites
- Requires network connectivity
-
Hybrid Approach:
- Light models for common commands
- Powerful models for complex speech
Conclusion
The listener service represents your assistant’s ability to hear and understand speech. While complex, its modular design lets you choose components that match your hardware capabilities and privacy requirements.
Next in series: How the Voice Assistant Speaks: Understanding OVOS Audio Services
Previous: The Voice Assistant’s Nervous System: Understanding the OVOS Message Bus