Voice Input

Jarvis supports hands-free voice input using OpenAI Whisper for speech-to-text transcription. Choose between push-to-talk (PTT) and voice activity detection (VAD) modes.

Architecture

Voice input uses a client-server architecture:

Jarvis (Rust)
  ↓ Audio Capture
  ↓ Resampling (24kHz → 16kHz)
  ↓ Unix Socket
Whisper Server (Python)
  ↓ whisper.cpp
  ↓ Transcription
  ↓ JSON Response
Jarvis
  ↓ Text Input to Terminal

Push-to-Talk

Hold a key to record, release to transcribe (default: Option+Period)

Voice Activity

Automatic recording when speech is detected

Whisper Integration

OpenAI Whisper model with technical vocabulary biasing

Audio Feedback

Configurable beep sounds for recording start/end

Configuration

Voice Config

[voice]
enabled = true
mode = "ptt"                    # "ptt" or "vad"
input_device = "default"
sample_rate = 24000
whisper_sample_rate = 16000

[voice.ptt]
key = "Option+Period"           # PTT trigger key
cooldown = 0.3                  # Seconds between recordings

[voice.vad]
silence_threshold = 1.0         # Seconds of silence to stop
energy_threshold = 300          # Audio energy threshold

[voice.sounds]
enabled = true
volume = 0.5
listen_start = true             # Beep on recording start
listen_end = true               # Beep on recording end

Push-to-Talk (PTT)

How It Works

User holds PTT key (default: Option+Period)
Audio capture starts
Feedback beep plays (if enabled)
User speaks
User releases key
Audio sent to Whisper server
Transcribed text inserted at cursor

Configuring PTT Key

Option+Period
Custom Modifier
Function Key

Default macOS binding:

[voice.ptt]
key = "Option+Period"

Use any modifier + key:

[voice.ptt]
key = "Ctrl+Space"

Supported modifiers:

Ctrl, Control
Alt, Option
Shift
Cmd, Command, Super, Win

Use function keys:

[voice.ptt]
key = "F13"

Cooldown Period

Prevents accidental rapid-fire recordings:

[voice.ptt]
cooldown = 0.3  # 300ms minimum between recordings

Voice Activity Detection (VAD)

How It Works

Continuous audio monitoring
Speech detected when energy > threshold
Recording starts automatically
Stops after silence_threshold seconds of silence
Transcription sent to terminal

Tuning VAD

Energy Threshold: Adjust sensitivity to background noise:

[voice.vad]
energy_threshold = 300  # Higher = less sensitive

Recommended values:

Quiet room: 200-300
Office: 400-600
Noisy: 800-1200

Silence Threshold: How long to wait before stopping:

[voice.vad]
silence_threshold = 1.0  # Seconds

Recommended values:

Quick commands: 0.5-0.8
Sentences: 1.0-1.5
Paragraphs: 2.0-3.0

Whisper Server

Starting the Server

The Whisper server runs as a background Python process:

cd voice/
python whisper_server.py

Default settings:

Socket: /tmp/jarvis_whisper.sock
Model: small (400MB, fast, good accuracy)
Sample rate: 16kHz

Model Selection

Tiny (75MB)
Small (400MB)
Medium (1.5GB)
Large (3GB)

Fastest, lowest accuracy

export WHISPER_MODEL=tiny
python whisper_server.py

Speed: ~10x real-time Accuracy: Good for simple commands

Fast, good accuracy (default)

export WHISPER_MODEL=small
python whisper_server.py

Speed: ~3x real-time Accuracy: Excellent for technical vocabulary

Slower, better accuracy

export WHISPER_MODEL=medium
python whisper_server.py

Speed: ~1x real-time Accuracy: Near-perfect transcription

Slowest, best accuracy

export WHISPER_MODEL=large
python whisper_server.py

Speed: 0.5x real-time Accuracy: Publication-quality

Technical Vocabulary Biasing

The Whisper server includes a technical prompt to improve recognition of programming terms:

TECH_PROMPT = (
    "This is a software engineer dictating code and technical documentation. "
    "They frequently discuss: APIs, databases, frontend frameworks, backend services, "
    "cloud infrastructure, and AI/ML systems. Use programming terminology and proper "
    "capitalization for technical terms."
    # ... extensive vocabulary list
)

Recognized terms include:

Languages: JavaScript, TypeScript, Python, Rust, Go, Java, C++, Swift
Frameworks: React, Vue, Angular, Next.js, FastAPI, Django
Databases: PostgreSQL, MongoDB, Redis, SQLite, DynamoDB
Cloud: AWS, GCP, Azure, Kubernetes, Docker, Terraform
AI: Claude, GPT, Gemini, Llama, RAG, embeddings

Protocol

JSON over Unix socket: Request:

{
  "type": "transcribe",
  "audio_b64": "<base64-encoded float32 array>",
  "sample_rate": 16000
}

Response:

{
  "type": "result",
  "text": "hello world",
  "duration_ms": 450
}

Error:

{
  "type": "error",
  "message": "Model not loaded"
}

Audio Feedback

Enabling Sounds

[voice.sounds]
enabled = true
volume = 0.5                # 0.0 (mute) to 1.0 (max)
listen_start = true         # Beep when recording starts
listen_end = true           # Beep when recording ends

Custom Sounds

Replace default beeps by placing audio files in the assets directory:

jarvis-rs/assets/audio/
  listen_start.mp3
  listen_end.mp3

Supported formats: MP3, WAV, OGG

Rust Integration

WhisperClient

use jarvis_ai::{WhisperClient, WhisperConfig};

let config = WhisperConfig::new(env::var("OPENAI_API_KEY")?);
let client = WhisperClient::new(config);

let audio_data = record_audio().await?;
let transcript = client.transcribe(audio_data, "audio.wav").await?;

println!("Transcription: {}", transcript);

WhisperConfig

pub struct WhisperConfig {
    pub api_key: String,
    pub model: String,           // "whisper-1"
    pub language: Option<String>, // e.g., Some("en")
}

With language hint:

let config = WhisperConfig::new(api_key)
    .with_language("en");

Supported Audio Formats

MP3 (audio/mpeg)
M4A (audio/mp4)
WAV (audio/wav)
WebM (audio/webm)
OGG (audio/ogg)

Local Whisper Server

Installation

pip install pywhispercpp numpy

Running the Server

cd voice/
python whisper_server.py

Output:

INFO: Loading whisper.cpp model 'small'...
INFO: Model loaded in 1.2s
INFO: Listening on /tmp/jarvis_whisper.sock

Server Architecture

class WhisperServer:
    def __init__(self, socket_path, model_name="small"):
        self.socket_path = socket_path
        self._transcriber = WhisperTranscriber(model_name)
        self._lock = threading.Lock()
    
    def start(self):
        # Unix domain socket server
        # Lazy model loading on first request
        # Thread-safe transcription

Lazy Loading

Model loads only on first transcription request:

@property
def model(self):
    if self._model is None:
        log.info(f"Loading whisper.cpp model '{self.model_name}'...")
        self._model = Model(self.model_name, print_progress=False)
    return self._model

Artifact Filtering

Removes Whisper hallucination artifacts:

ARTIFACT_RE = re.compile(
    r"\[(?:end|blank_audio|silence|music|applause)\]",
    re.IGNORECASE,
)

text = ARTIFACT_RE.sub("", text)  # Remove [end], [silence], etc.
text = re.sub(r"\s+", " ", text).strip()  # Normalize whitespace

Troubleshooting

No Audio Captured

Check input device:

[voice]
input_device = "default"  # or specific device name

List available devices:

python -c "import pyaudio; p = pyaudio.PyAudio(); [print(p.get_device_info_by_index(i)) for i in range(p.get_device_count())]"

Server Not Responding

Check socket:

ls -la /tmp/jarvis_whisper.sock
# Should show: srwxr-xr-x ... jarvis_whisper.sock

Restart server:

rm /tmp/jarvis_whisper.sock
python voice/whisper_server.py

Poor Transcription Quality

Use larger model:

export WHISPER_MODEL=medium
python whisper_server.py

Add custom vocabulary: Edit TECH_PROMPT in whisper_server.py to include domain-specific terms.

High Latency

Use smaller model:

export WHISPER_MODEL=tiny

Reduce cooldown:

[voice.ptt]
cooldown = 0.1

Source files:

jarvis-rs/crates/jarvis-ai/src/whisper.rs
jarvis-rs/crates/jarvis-config/src/schema/voice.rs
voice/whisper_server.py
voice/whisper_client.py
config.py

The local Whisper server eliminates external API dependencies and provides faster transcription than OpenAI’s cloud service.

​Voice Input

​Architecture

Push-to-Talk

Voice Activity

Whisper Integration

Audio Feedback

​Configuration

​Voice Config

​Push-to-Talk (PTT)

​How It Works

​Configuring PTT Key

​Cooldown Period

​Voice Activity Detection (VAD)

​How It Works

​Tuning VAD

​Whisper Server

​Starting the Server

​Model Selection

​Technical Vocabulary Biasing

​Protocol

​Audio Feedback

​Enabling Sounds

​Custom Sounds

​Rust Integration

​WhisperClient

​WhisperConfig

​Supported Audio Formats

​Local Whisper Server

​Installation

​Running the Server

​Server Architecture

​Lazy Loading

​Artifact Filtering

​Troubleshooting

​No Audio Captured

​Server Not Responding

​Poor Transcription Quality

​High Latency

Voice Input

Architecture

Configuration

Voice Config

Push-to-Talk (PTT)

How It Works

Configuring PTT Key

Cooldown Period

Voice Activity Detection (VAD)

How It Works

Tuning VAD

Whisper Server

Starting the Server

Model Selection

Technical Vocabulary Biasing

Protocol

Audio Feedback

Enabling Sounds

Custom Sounds

Rust Integration

WhisperClient

WhisperConfig

Supported Audio Formats

Local Whisper Server

Installation

Running the Server

Server Architecture

Lazy Loading

Artifact Filtering

Troubleshooting

No Audio Captured

Server Not Responding

Poor Transcription Quality

High Latency