Speech to Text — Vosk (local)

Streaming offline speech recognition via the Kaldi-based Vosk engine, compiled to WebAssembly.

Truly local. Audio is processed by vosk-browser running entirely in WASM in this tab. No audio is sent to any server. Models download once and are cached.

1. Pick a language and load the model

not loaded

2. Listen

idle

mic level

—

Transcript

transcript will appear here. partial results appear in yellow as you speak; final results lock in at pauses.

0 words

Diagnostic log

How this differs from the Whisper version

Streaming. Vosk processes audio as it arrives and emits partial results word-by-word. You see text appear while you're still speaking.
Smaller models. The default English-small is ~40 MB vs Whisper's 75–500 MB.
Faster on weak hardware. Vosk runs realtime on phones; Whisper-tiny barely does.
Lower accuracy. Whisper handles accents, background noise, and unfamiliar terms better.
No punctuation. Vosk returns lowercase text with no punctuation.