Skip to main content

Voice Input

Jarvis supports hands-free voice input using OpenAI Whisper for speech-to-text transcription. Choose between push-to-talk (PTT) and voice activity detection (VAD) modes.

Architecture

Voice input uses a client-server architecture:
Jarvis (Rust)
  ↓ Audio Capture
  ↓ Resampling (24kHz → 16kHz)
  ↓ Unix Socket
Whisper Server (Python)
  ↓ whisper.cpp
  ↓ Transcription
  ↓ JSON Response
Jarvis
  ↓ Text Input to Terminal

Push-to-Talk

Hold a key to record, release to transcribe (default: Option+Period)

Voice Activity

Automatic recording when speech is detected

Whisper Integration

OpenAI Whisper model with technical vocabulary biasing

Audio Feedback

Configurable beep sounds for recording start/end

Configuration

Voice Config

[voice]
enabled = true
mode = "ptt"                    # "ptt" or "vad"
input_device = "default"
sample_rate = 24000
whisper_sample_rate = 16000

[voice.ptt]
key = "Option+Period"           # PTT trigger key
cooldown = 0.3                  # Seconds between recordings

[voice.vad]
silence_threshold = 1.0         # Seconds of silence to stop
energy_threshold = 300          # Audio energy threshold

[voice.sounds]
enabled = true
volume = 0.5
listen_start = true             # Beep on recording start
listen_end = true               # Beep on recording end

Push-to-Talk (PTT)

How It Works

  1. User holds PTT key (default: Option+Period)
  2. Audio capture starts
  3. Feedback beep plays (if enabled)
  4. User speaks
  5. User releases key
  6. Audio sent to Whisper server
  7. Transcribed text inserted at cursor

Configuring PTT Key

Default macOS binding:
[voice.ptt]
key = "Option+Period"

Cooldown Period

Prevents accidental rapid-fire recordings:
[voice.ptt]
cooldown = 0.3  # 300ms minimum between recordings

Voice Activity Detection (VAD)

How It Works

  1. Continuous audio monitoring
  2. Speech detected when energy > threshold
  3. Recording starts automatically
  4. Stops after silence_threshold seconds of silence
  5. Transcription sent to terminal

Tuning VAD

Energy Threshold: Adjust sensitivity to background noise:
[voice.vad]
energy_threshold = 300  # Higher = less sensitive
Recommended values:
  • Quiet room: 200-300
  • Office: 400-600
  • Noisy: 800-1200
Silence Threshold: How long to wait before stopping:
[voice.vad]
silence_threshold = 1.0  # Seconds
Recommended values:
  • Quick commands: 0.5-0.8
  • Sentences: 1.0-1.5
  • Paragraphs: 2.0-3.0

Whisper Server

Starting the Server

The Whisper server runs as a background Python process:
cd voice/
python whisper_server.py
Default settings:
  • Socket: /tmp/jarvis_whisper.sock
  • Model: small (400MB, fast, good accuracy)
  • Sample rate: 16kHz

Model Selection

Fastest, lowest accuracy
export WHISPER_MODEL=tiny
python whisper_server.py
Speed: ~10x real-time Accuracy: Good for simple commands

Technical Vocabulary Biasing

The Whisper server includes a technical prompt to improve recognition of programming terms:
TECH_PROMPT = (
    "This is a software engineer dictating code and technical documentation. "
    "They frequently discuss: APIs, databases, frontend frameworks, backend services, "
    "cloud infrastructure, and AI/ML systems. Use programming terminology and proper "
    "capitalization for technical terms."
    # ... extensive vocabulary list
)
Recognized terms include:
  • Languages: JavaScript, TypeScript, Python, Rust, Go, Java, C++, Swift
  • Frameworks: React, Vue, Angular, Next.js, FastAPI, Django
  • Databases: PostgreSQL, MongoDB, Redis, SQLite, DynamoDB
  • Cloud: AWS, GCP, Azure, Kubernetes, Docker, Terraform
  • AI: Claude, GPT, Gemini, Llama, RAG, embeddings

Protocol

JSON over Unix socket: Request:
{
  "type": "transcribe",
  "audio_b64": "<base64-encoded float32 array>",
  "sample_rate": 16000
}
Response:
{
  "type": "result",
  "text": "hello world",
  "duration_ms": 450
}
Error:
{
  "type": "error",
  "message": "Model not loaded"
}

Audio Feedback

Enabling Sounds

[voice.sounds]
enabled = true
volume = 0.5                # 0.0 (mute) to 1.0 (max)
listen_start = true         # Beep when recording starts
listen_end = true           # Beep when recording ends

Custom Sounds

Replace default beeps by placing audio files in the assets directory:
jarvis-rs/assets/audio/
  listen_start.mp3
  listen_end.mp3
Supported formats: MP3, WAV, OGG

Rust Integration

WhisperClient

use jarvis_ai::{WhisperClient, WhisperConfig};

let config = WhisperConfig::new(env::var("OPENAI_API_KEY")?);
let client = WhisperClient::new(config);

let audio_data = record_audio().await?;
let transcript = client.transcribe(audio_data, "audio.wav").await?;

println!("Transcription: {}", transcript);

WhisperConfig

pub struct WhisperConfig {
    pub api_key: String,
    pub model: String,           // "whisper-1"
    pub language: Option<String>, // e.g., Some("en")
}
With language hint:
let config = WhisperConfig::new(api_key)
    .with_language("en");

Supported Audio Formats

  • MP3 (audio/mpeg)
  • M4A (audio/mp4)
  • WAV (audio/wav)
  • WebM (audio/webm)
  • OGG (audio/ogg)

Local Whisper Server

Installation

pip install pywhispercpp numpy

Running the Server

cd voice/
python whisper_server.py
Output:
INFO: Loading whisper.cpp model 'small'...
INFO: Model loaded in 1.2s
INFO: Listening on /tmp/jarvis_whisper.sock

Server Architecture

class WhisperServer:
    def __init__(self, socket_path, model_name="small"):
        self.socket_path = socket_path
        self._transcriber = WhisperTranscriber(model_name)
        self._lock = threading.Lock()
    
    def start(self):
        # Unix domain socket server
        # Lazy model loading on first request
        # Thread-safe transcription

Lazy Loading

Model loads only on first transcription request:
@property
def model(self):
    if self._model is None:
        log.info(f"Loading whisper.cpp model '{self.model_name}'...")
        self._model = Model(self.model_name, print_progress=False)
    return self._model

Artifact Filtering

Removes Whisper hallucination artifacts:
ARTIFACT_RE = re.compile(
    r"\[(?:end|blank_audio|silence|music|applause)\]",
    re.IGNORECASE,
)

text = ARTIFACT_RE.sub("", text)  # Remove [end], [silence], etc.
text = re.sub(r"\s+", " ", text).strip()  # Normalize whitespace

Troubleshooting

No Audio Captured

Check input device:
[voice]
input_device = "default"  # or specific device name
List available devices:
python -c "import pyaudio; p = pyaudio.PyAudio(); [print(p.get_device_info_by_index(i)) for i in range(p.get_device_count())]"

Server Not Responding

Check socket:
ls -la /tmp/jarvis_whisper.sock
# Should show: srwxr-xr-x ... jarvis_whisper.sock
Restart server:
rm /tmp/jarvis_whisper.sock
python voice/whisper_server.py

Poor Transcription Quality

Use larger model:
export WHISPER_MODEL=medium
python whisper_server.py
Add custom vocabulary: Edit TECH_PROMPT in whisper_server.py to include domain-specific terms.

High Latency

Use smaller model:
export WHISPER_MODEL=tiny
Reduce cooldown:
[voice.ptt]
cooldown = 0.1
Source files:
  • jarvis-rs/crates/jarvis-ai/src/whisper.rs
  • jarvis-rs/crates/jarvis-config/src/schema/voice.rs
  • voice/whisper_server.py
  • voice/whisper_client.py
  • config.py
The local Whisper server eliminates external API dependencies and provides faster transcription than OpenAI’s cloud service.