The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks

This is part 3 of VORA's architecture series. Part 1: Why I shipped VORA without backend code (the product philosophy). Part 2: From Python Server to Pure Browser (the migration). This post: the Whisper WASM experiment I ran after the migration.

Running speech AI entirely in the browser sounds perfect. No server. Total privacy. Works offline. It's the kind of idea that sounds so good in your head that you skip straight to implementation without questioning it.

I questioned it eventually. After two experiments, several broken deploys, and a phone that got so hot I could've used it as a space heater.

💭 Why I Tried This

After ditching the Python server, VORA ran entirely in the browser. The natural next thought: what if transcription ran there too?

The tech stack looked ready:

WebAssembly for heavy compute
ONNX Runtime Web for model inference
WebGPU support slowly arriving

So I tried it. Spoiler: "technically possible" and "product-ready" aren't even in the same zip code.

🧪 Experiment 1: ONNX Runtime Web + Whisper

It worked. Functionally. On my desktop. With a good microphone. In a quiet room.

Then reality showed up.

📦 The Model Size Problem

Large Whisper models produce great transcripts but take forever to download on first load. Smaller models load fast but fumble on real meeting audio. It's like choosing between a restaurant that's amazing but has a 40-minute wait, and one that seats you immediately but serves food that's... fine.

Accuracy model: too large to load without users closing the tab
Lightweight model: not accurate enough for anything beyond a demo

⚠️ The SharedArrayBuffer Nightmare

To get decent performance with threading, I needed SharedArrayBuffer. Which requires these headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Sounds simple. Enabling them broke everything. Third-party assets stopped loading. My static hosting setup needed a complete overhaul. I eventually got it working, but the deployment went from "upload files" to "configure security headers, pray, check three browsers, pray again."

🧪 Experiment 2: Streaming with sherpa-onnx

I also tried sherpa-onnx for streaming transcription. The latency characteristics were genuinely promising — it felt closer to real-time than the batch approach.

But the same walls kept appearing:

Still a big bundle to download
Still needed cross-origin isolation
Still struggled with Korean technical meeting audio

Different library, same constraints.

📱 The Hardware Reality Check

On desktop? Acceptable for certain model configs.

On mobile? My phone started lagging, heating up, and generally protesting. The base Whisper model clocks in around 140MB compressed, expanding to 400MB+ in memory during inference. On a typical phone with 4GB RAM and a Snapdragon 888, that leaves maybe 2GB for everything else — OS, browser tabs, your actual app. Inference on a 30-second audio clip stretched to 8-12 seconds. Live meeting transcription needs immediate feedback. A transcript that arrives five seconds late isn't "slightly delayed" — it's useless for real-time note-taking.

🎯 Korean Technical Speech: Extra Hard Mode

Korean meeting audio with mixed English technical terms is basically the boss fight of speech recognition:

Acronyms (API, SDK, PCR)
Code and infrastructure terms
Rapid topic switching
People talking over each other

Generic ASR models aren't built for this. Domain-aware correction matters as much as raw model quality — sometimes more.

✅ Where Browser Whisper Actually Makes Sense

I didn't throw the whole experiment away. I narrowed the scope.

It's genuinely good for:

Offline transcription of pre-recorded audio
Privacy-sensitive local processing (no audio leaves the device)
Non-real-time workflows where users don't mind waiting

It's not good for:

Low-latency, always-on live meeting transcription on whatever random hardware your users have

🚀 The Product Call: Labs, Not Core

VORA Labs Overview

I kept Whisper-in-browser as an experimental feature in VORA's Labs section. The main app's real-time transcription stays on the lower-latency path that actually works across devices.

Early feedback from users who tried the Labs feature was split: power users who regularly work offline genuinely appreciated the ability to transcribe without a server call, even if they waited an extra few seconds. But most users didn't enable it. The cognitive load of "wait, which transcriber am I using right now?" outweighed the privacy win. People want transcription to just work instantly. They don't want to think about infrastructure choices while they're trying to take meeting notes.

I'm still exploring a hybrid approach: live transcript first (fast, good enough), then a higher-accuracy pass afterward (slower, better). Best of both worlds -- in theory. I'll report back when it's more than theory.

🧠 What I Learned

"Can it run in the browser?" is the wrong question. The right question is: "Can it run in the browser, on your user's actual hardware, fast enough that they don't notice it's running?"

If your product needs immediate output, optimize for latency. If your product needs accuracy on recorded content and privacy matters, browser inference is a real option. Just don't confuse a working demo on your M2 MacBook with a shipping product.

2026.02.05