Speech to Text — Whisper (local)

Runs OpenAI's Whisper model entirely in your browser. After the model downloads once, audio never leaves the device.

Truly local. Audio is processed on this page using transformers.js running ONNX models in WebAssembly + WebGPU. No audio is sent to any server. The model itself is downloaded once from Hugging Face's CDN, then cached by your browser.

1. Pick a model and load it

not loaded

2. Listen

idle

mic level

silent queue: 0

Transcript

transcript will appear here. each chunk arrives 1–4 s after the speaker pauses.

0 words

How this works

Audio is captured at 16 kHz mono (Whisper's native rate).
A simple voice-activity detector chunks audio at natural pauses (700 ms of silence ends a segment).
Each segment is fed to Whisper running locally. Results append to the transcript.
Latency = segment length + Whisper inference time. On a recent phone, expect 2–5× realtime; a 5 s utterance shows up 1–4 s after you stop talking.
The first model load is slow (75–500 MB download). Subsequent loads use the browser cache.