The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks
I tested Whisper via ONNX Runtime Web and sherpa-onnx in the browser. Model size tradeoffs, SharedArrayBuffer isolation, and why browser inference works for offline use but not live meeting transcription.
Series: VORA B.LOG
- 1. Why I shipped VORA before writing a single line of backend code
- 2. From Python Server to Pure Browser: The Architecture Pivot That Changed Everything
- 3. The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks β you are here
- 4. Why We Killed Speaker Identification (And What We Learned from Two Weeks of Failure)
- 5. Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)
- 6. Building the Priority Queue: How We Stopped Gemini API Chaos β and Why the First Two Designs Both Failed
- 7. Groq Dual-AI Integration: Why I Added a Second AI and What It Actually Fixed
- 8. The Meeting Summary Timer Bug: Why setInterval Isn't Enough for Reliable Scheduling
- 9. Building a Real Meeting Export: From Raw Transcript to a Usable Report
- 10. The Dark Theme Redesign: Building a UI That Looks Like a Professional Tool (After It Looked Like a Hobbyist Project)
- 11. The Branding Journey: From a Functional Name to VORA
- 12. How We Made VORA Bilingual Without a Heavy Localization Stack
- 13. Deploying to Cloudflare Pages: Static Hosting, CORS Headers, and the Sitemap/Robots Incident
- 14. How I Fixed AI Over-correction
- 15. The VORA Overhaul: Dropping Real-Time Q&A, Building Human-in-the-Loop Memos, and a Three-Column Layout
This is part 3 of VORA's architecture series. Part 1: Why I shipped VORA without backend code (the product philosophy). Part 2: From Python Server to Pure Browser (the migration). This post: the Whisper WASM experiment I ran after the migration.
Running speech AI entirely in the browser sounds perfect. No server. Total privacy. Works offline. It's the kind of idea that sounds so good in your head that you skip straight to implementation without questioning it.
I questioned it eventually. After two experiments, several broken deploys, and a phone that got so hot I could've used it as a space heater.
π Why I Tried This
After ditching the Python server, VORA ran entirely in the browser. The natural next thought: what if transcription ran there too?
The tech stack looked ready:
- WebAssembly for heavy compute
- ONNX Runtime Web for model inference
- WebGPU support slowly arriving
So I tried it. Spoiler: "technically possible" and "product-ready" aren't even in the same zip code.
π§ͺ Experiment 1: ONNX Runtime Web + Whisper
It worked. Functionally. On my desktop. With a good microphone. In a quiet room.
Then reality showed up.
π¦ The Model Size Problem
Large Whisper models produce great transcripts but take forever to download on first load. Smaller models load fast but fumble on real meeting audio. It's like choosing between a restaurant that's amazing but has a 40-minute wait, and one that seats you immediately but serves food that's... fine.
- Accuracy model: too large to load without users closing the tab
- Lightweight model: not accurate enough for anything beyond a demo
β οΈ The SharedArrayBuffer Nightmare
To get decent performance with threading, I needed SharedArrayBuffer. Which requires these headers:
Cross-Origin-Opener-Policy: same-originCross-Origin-Embedder-Policy: require-corp
Sounds simple. Enabling them broke everything. Third-party assets stopped loading. My static hosting setup needed a complete overhaul. I eventually got it working, but the deployment went from "upload files" to "configure security headers, pray, check three browsers, pray again."
π§ͺ Experiment 2: Streaming with sherpa-onnx
I also tried sherpa-onnx for streaming transcription. The latency characteristics were genuinely promising β it felt closer to real-time than the batch approach.
But the same walls kept appearing:
- Still a big bundle to download
- Still needed cross-origin isolation
- Still struggled with Korean technical meeting audio
Different library, same constraints.
π± The Hardware Reality Check
On desktop? Acceptable for certain model configs.
On mobile? My phone started lagging, heating up, and generally protesting. The base Whisper model clocks in around 140MB compressed, expanding to 400MB+ in memory during inference. On a typical phone with 4GB RAM and a Snapdragon 888, that leaves maybe 2GB for everything else β OS, browser tabs, your actual app. Inference on a 30-second audio clip stretched to 8-12 seconds. Live meeting transcription needs immediate feedback. A transcript that arrives five seconds late isn't "slightly delayed" β it's useless for real-time note-taking.
π― Korean Technical Speech: Extra Hard Mode
Korean meeting audio with mixed English technical terms is basically the boss fight of speech recognition:
- Acronyms (API, SDK, PCR)
- Code and infrastructure terms
- Rapid topic switching
- People talking over each other
Generic ASR models aren't built for this. Domain-aware correction matters as much as raw model quality β sometimes more.
β Where Browser Whisper Actually Makes Sense
I didn't throw the whole experiment away. I narrowed the scope.
It's genuinely good for:
- Offline transcription of pre-recorded audio
- Privacy-sensitive local processing (no audio leaves the device)
- Non-real-time workflows where users don't mind waiting
It's not good for:
- Low-latency, always-on live meeting transcription on whatever random hardware your users have
π The Product Call: Labs, Not Core
I kept Whisper-in-browser as an experimental feature in VORA's Labs section. The main app's real-time transcription stays on the lower-latency path that actually works across devices.
Early feedback from users who tried the Labs feature was split: power users who regularly work offline genuinely appreciated the ability to transcribe without a server call, even if they waited an extra few seconds. But most users didn't enable it. The cognitive load of "wait, which transcriber am I using right now?" outweighed the privacy win. People want transcription to just work instantly. They don't want to think about infrastructure choices while they're trying to take meeting notes.
I'm still exploring a hybrid approach: live transcript first (fast, good enough), then a higher-accuracy pass afterward (slower, better). Best of both worlds -- in theory. I'll report back when it's more than theory.
π§ What I Learned
"Can it run in the browser?" is the wrong question. The right question is: "Can it run in the browser, on your user's actual hardware, fast enough that they don't notice it's running?"
If your product needs immediate output, optimize for latency. If your product needs accuracy on recorded content and privacy matters, browser inference is a real option. Just don't confuse a working demo on your M2 MacBook with a shipping product.
2026.02.05
Written by
Jay
Licensed Pharmacist Β· Senior Researcher
Building production-grade AI tools across medicine, finance, and productivity β without a CS degree. Domain expertise first, code second.
About the author βRelated posts