The Voice Assistant's Ears: Understanding OVOS Listener Services

Sun, Nov 10, 2024
3-minute read

Continuing our exploration of OVOS and Neon.AI components, let’s examine how your assistant hears and understands spoken commands. The listener service (ovos-dinkum-listener) is like your assistant’s ears and early speech processing - it handles everything from detecting wake words to converting speech to text.

Listener Architecture

The listener service coordinates four critical components:

Microphone input
Wake word detection
Voice Activity Detection (VAD)
Speech-to-Text (STT)

These components communicate through the message bus we discussed in the previous article, working together to turn audio into text commands your assistant can process.

Component Details

1. Microphone Plugins

Microphone plugins capture audio input from your hardware. OVOS supports multiple backends:

alsa (recommended for Linux)
pyaudio (most reliable for Windows/MacOS, also works with Linux)
sounddevice (experimental, works for all 3 major platforms)

Browse available microphone plugins to find one that matches your system.

2. Wake Word Detection

Wake word detection (also called hotword detection) listens for specific trigger phrases like “Hey Mycroft” or “Hey Neon”. Popular options include:

OpenWakeWord:
- Modern synthetic training approach
- Adopted by Home Assistant
- Pre-trained models available
- Can create custom wake words
Precise-Lite:
- Lightweight Mycroft engine
- Good for resource-constrained devices (minimum suggested: Raspberry Pi 3B+)
VOSK:
- Text-based detection
- No custom model training needed
- More flexible but less precise

Browse wake word plugins for more options.

3. Voice Activity Detection (VAD)

VAD determines when someone is speaking versus background noise. This helps:

Reduce processing load
Trim silence from recordings
Improve STT accuracy

Popular VAD options:

Silero: Community favorite, works across languages. Speech detection, so it’s more accurate than noise-based VAD.
WebRTC: Lightweight, good for web integration. Noise-based, so it’s less accurate but faster.

Browse VAD plugins for more choices.

4. Speech-to-Text (STT)

STT converts spoken audio to text. This is typically the most resource-intensive component. Current leading options:

NVIDIA NeMo (Citrinet model):
- Fast with GPU acceleration
- Community-optimized CPU versions available
- Neon offers a Raspberry Pi-optimized version
OpenAI Whisper:
- Excellent accuracy
- Multiple model sizes
- GPU recommended but CPU versions available

Browse STT plugins for more options.

How It All Works Together

Let’s follow the flow of a voice command:

The listener connects to the message bus
Microphone plugin starts capturing audio
Wake word plugin monitors the audio stream
When wake word is detected:
- VAD starts monitoring for speech
- STT prepares to receive audio
When speech is detected:
- VAD trims silence
- Audio is sent to STT
STT converts speech to text
Text is sent over message bus to other components

This message flow looks something like:

// Wake word detected
{
    "type": "recognizer_loop:wakeword",
    "data": {
        "wake_word": "hey_mycroft"
    }
}

// Speech detected and processed
{
    "type": "recognizer_loop:utterance",
    "data": {
        "utterances": ["what's the weather like today"]
    }
}

Resource Considerations

STT is usually the most demanding component. You have several options:

Full Local Processing:
- Requires decent CPU or GPU (not recommended on Raspberry Pi boards yet)
- Complete privacy
- Fastest response time
Distributed Processing:
- Run STT on a more powerful machine
- Use lighter models on satellites
- Requires network connectivity
Hybrid Approach:
- Light models for common commands
- Powerful models for complex speech

Conclusion

The listener service represents your assistant’s ability to hear and understand speech. While complex, its modular design lets you choose components that match your hardware capabilities and privacy requirements.

Next in series: How the Voice Assistant Speaks: Understanding OVOS Audio Services

Previous: The Voice Assistant’s Nervous System: Understanding the OVOS Message Bus

home automation personal voice assistant building voice assistants homelab mycroft neon ovos