Voice Input
Jarvis supports hands-free voice input using OpenAI Whisper for speech-to-text transcription. Choose between push-to-talk (PTT) and voice activity detection (VAD) modes.
Architecture
Voice input uses a client-server architecture:
Jarvis (Rust)
↓ Audio Capture
↓ Resampling (24kHz → 16kHz)
↓ Unix Socket
Whisper Server (Python)
↓ whisper.cpp
↓ Transcription
↓ JSON Response
Jarvis
↓ Text Input to Terminal
Push-to-Talk
Hold a key to record, release to transcribe (default: Option+Period)
Voice Activity
Automatic recording when speech is detected
Whisper Integration
OpenAI Whisper model with technical vocabulary biasing
Audio Feedback
Configurable beep sounds for recording start/end
Configuration
Voice Config
[voice]
enabled = true
mode = "ptt" # "ptt" or "vad"
input_device = "default"
sample_rate = 24000
whisper_sample_rate = 16000
[voice.ptt]
key = "Option+Period" # PTT trigger key
cooldown = 0.3 # Seconds between recordings
[voice.vad]
silence_threshold = 1.0 # Seconds of silence to stop
energy_threshold = 300 # Audio energy threshold
[voice.sounds]
enabled = true
volume = 0.5
listen_start = true # Beep on recording start
listen_end = true # Beep on recording end
Push-to-Talk (PTT)
How It Works
- User holds PTT key (default:
Option+Period)
- Audio capture starts
- Feedback beep plays (if enabled)
- User speaks
- User releases key
- Audio sent to Whisper server
- Transcribed text inserted at cursor
Configuring PTT Key
Option+Period
Custom Modifier
Function Key
Default macOS binding:[voice.ptt]
key = "Option+Period"
Use any modifier + key:[voice.ptt]
key = "Ctrl+Space"
Supported modifiers:
Ctrl, Control
Alt, Option
Shift
Cmd, Command, Super, Win
Cooldown Period
Prevents accidental rapid-fire recordings:
[voice.ptt]
cooldown = 0.3 # 300ms minimum between recordings
Voice Activity Detection (VAD)
How It Works
- Continuous audio monitoring
- Speech detected when energy > threshold
- Recording starts automatically
- Stops after
silence_threshold seconds of silence
- Transcription sent to terminal
Tuning VAD
Energy Threshold:
Adjust sensitivity to background noise:
[voice.vad]
energy_threshold = 300 # Higher = less sensitive
Recommended values:
- Quiet room: 200-300
- Office: 400-600
- Noisy: 800-1200
Silence Threshold:
How long to wait before stopping:
[voice.vad]
silence_threshold = 1.0 # Seconds
Recommended values:
- Quick commands: 0.5-0.8
- Sentences: 1.0-1.5
- Paragraphs: 2.0-3.0
Whisper Server
Starting the Server
The Whisper server runs as a background Python process:
cd voice/
python whisper_server.py
Default settings:
- Socket:
/tmp/jarvis_whisper.sock
- Model:
small (400MB, fast, good accuracy)
- Sample rate: 16kHz
Model Selection
Tiny (75MB)
Small (400MB)
Medium (1.5GB)
Large (3GB)
Fastest, lowest accuracyexport WHISPER_MODEL=tiny
python whisper_server.py
Speed: ~10x real-time
Accuracy: Good for simple commands Fast, good accuracy (default)export WHISPER_MODEL=small
python whisper_server.py
Speed: ~3x real-time
Accuracy: Excellent for technical vocabulary Slower, better accuracyexport WHISPER_MODEL=medium
python whisper_server.py
Speed: ~1x real-time
Accuracy: Near-perfect transcription Slowest, best accuracyexport WHISPER_MODEL=large
python whisper_server.py
Speed: 0.5x real-time
Accuracy: Publication-quality
Technical Vocabulary Biasing
The Whisper server includes a technical prompt to improve recognition of programming terms:
TECH_PROMPT = (
"This is a software engineer dictating code and technical documentation. "
"They frequently discuss: APIs, databases, frontend frameworks, backend services, "
"cloud infrastructure, and AI/ML systems. Use programming terminology and proper "
"capitalization for technical terms."
# ... extensive vocabulary list
)
Recognized terms include:
- Languages: JavaScript, TypeScript, Python, Rust, Go, Java, C++, Swift
- Frameworks: React, Vue, Angular, Next.js, FastAPI, Django
- Databases: PostgreSQL, MongoDB, Redis, SQLite, DynamoDB
- Cloud: AWS, GCP, Azure, Kubernetes, Docker, Terraform
- AI: Claude, GPT, Gemini, Llama, RAG, embeddings
Protocol
JSON over Unix socket:
Request:
{
"type": "transcribe",
"audio_b64": "<base64-encoded float32 array>",
"sample_rate": 16000
}
Response:
{
"type": "result",
"text": "hello world",
"duration_ms": 450
}
Error:
{
"type": "error",
"message": "Model not loaded"
}
Audio Feedback
Enabling Sounds
[voice.sounds]
enabled = true
volume = 0.5 # 0.0 (mute) to 1.0 (max)
listen_start = true # Beep when recording starts
listen_end = true # Beep when recording ends
Custom Sounds
Replace default beeps by placing audio files in the assets directory:
jarvis-rs/assets/audio/
listen_start.mp3
listen_end.mp3
Supported formats: MP3, WAV, OGG
Rust Integration
WhisperClient
use jarvis_ai::{WhisperClient, WhisperConfig};
let config = WhisperConfig::new(env::var("OPENAI_API_KEY")?);
let client = WhisperClient::new(config);
let audio_data = record_audio().await?;
let transcript = client.transcribe(audio_data, "audio.wav").await?;
println!("Transcription: {}", transcript);
WhisperConfig
pub struct WhisperConfig {
pub api_key: String,
pub model: String, // "whisper-1"
pub language: Option<String>, // e.g., Some("en")
}
With language hint:
let config = WhisperConfig::new(api_key)
.with_language("en");
- MP3 (
audio/mpeg)
- M4A (
audio/mp4)
- WAV (
audio/wav)
- WebM (
audio/webm)
- OGG (
audio/ogg)
Local Whisper Server
Installation
pip install pywhispercpp numpy
Running the Server
cd voice/
python whisper_server.py
Output:
INFO: Loading whisper.cpp model 'small'...
INFO: Model loaded in 1.2s
INFO: Listening on /tmp/jarvis_whisper.sock
Server Architecture
class WhisperServer:
def __init__(self, socket_path, model_name="small"):
self.socket_path = socket_path
self._transcriber = WhisperTranscriber(model_name)
self._lock = threading.Lock()
def start(self):
# Unix domain socket server
# Lazy model loading on first request
# Thread-safe transcription
Lazy Loading
Model loads only on first transcription request:
@property
def model(self):
if self._model is None:
log.info(f"Loading whisper.cpp model '{self.model_name}'...")
self._model = Model(self.model_name, print_progress=False)
return self._model
Artifact Filtering
Removes Whisper hallucination artifacts:
ARTIFACT_RE = re.compile(
r"\[(?:end|blank_audio|silence|music|applause)\]",
re.IGNORECASE,
)
text = ARTIFACT_RE.sub("", text) # Remove [end], [silence], etc.
text = re.sub(r"\s+", " ", text).strip() # Normalize whitespace
Troubleshooting
No Audio Captured
Check input device:
[voice]
input_device = "default" # or specific device name
List available devices:
python -c "import pyaudio; p = pyaudio.PyAudio(); [print(p.get_device_info_by_index(i)) for i in range(p.get_device_count())]"
Server Not Responding
Check socket:
ls -la /tmp/jarvis_whisper.sock
# Should show: srwxr-xr-x ... jarvis_whisper.sock
Restart server:
rm /tmp/jarvis_whisper.sock
python voice/whisper_server.py
Poor Transcription Quality
Use larger model:
export WHISPER_MODEL=medium
python whisper_server.py
Add custom vocabulary:
Edit TECH_PROMPT in whisper_server.py to include domain-specific terms.
High Latency
Use smaller model:
export WHISPER_MODEL=tiny
Reduce cooldown:
[voice.ptt]
cooldown = 0.1
Source files:
jarvis-rs/crates/jarvis-ai/src/whisper.rs
jarvis-rs/crates/jarvis-config/src/schema/voice.rs
voice/whisper_server.py
voice/whisper_client.py
config.py
The local Whisper server eliminates external API dependencies and provides faster transcription than OpenAI’s cloud service.