The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks
Running speech AI fully in the browser sounds perfect on paper: no server dependency stronger privacy offline capability We tested that path with Whisper + WASM and learned a simple lesson: "technically possible" is not the same as "product-ready."...
Series: VORA B.LOG
- 1. Why I shipped VORA before writing a single line of backend code
- 2. From Python Server to Pure Browser: The Architecture Pivot That Changed Everything
- 3. The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks ← you are here
- 4. Why We Killed Speaker Identification (And What We Learned from Two Weeks of Failure)
- 5. Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)
- 6. Building the Priority Queue: How We Stopped Gemini API Chaos — and Why the First Two Designs Both Failed
- 7. Groq Dual-AI Integration: Why We Added a Second AI and What It Actually Fixed
- 8. The Meeting Summary Timer Bug: Why setInterval Isn't Enough for Reliable Scheduling
- 9. Building a Real Meeting Export: From Raw Transcript to a Usable Report
- 10. The Dark Theme Redesign: Building a UI That Looks Like a Professional Tool (After It Looked Like a Hobbyist Project)
- 11. The Branding Journey: From a Functional Name to VORA
- 12. How We Made VORA Bilingual Without a Heavy Localization Stack
- 13. Deploying to Cloudflare Pages: Static Hosting, CORS Headers, and the Sitemap/Robots Incident
- 14. How I Fixed AI Over-correction
Running speech AI fully in the browser sounds perfect on paper:
- no server dependency
- stronger privacy
- offline capability
We tested that path with Whisper + WASM and learned a simple lesson: "technically possible" is not the same as "product-ready."
This post covers what we tried, where it failed, and where browser inference is actually useful today.
Why We Tried Whisper in the Browser
Whisper is a strong speech model, and the modern browser stack is much better than it used to be:
- WebAssembly for CPU-heavy compute
- ONNX Runtime Web
- growing WebGPU support
So the idea made sense: run transcription locally in the client, avoid sending raw audio out, and reduce backend load.
Experiment 1: ONNX Runtime Web + Whisper
Our first implementation worked functionally, but several constraints showed up fast.
1) Model size vs user experience
Large Whisper models are too heavy for first-run web UX. Smaller models are faster, but accuracy drops on real meeting audio.
That creates a hard tradeoff:
- accuracy model: too large to load comfortably
- lightweight model: not accurate enough for production transcription
2) SharedArrayBuffer and cross-origin isolation
To get practical performance with threading, we needed SharedArrayBuffer, which requires:
Cross-Origin-Opener-Policy: same-originCross-Origin-Embedder-Policy: require-corp
Enabling these headers broke multiple existing assumptions around third-party assets and static hosting setup. We eventually got this working, but the deployment surface became much more complex than expected.
Experiment 2: Streaming with sherpa-onnx
We also tested sherpa-onnx for streaming behavior. Latency characteristics were promising, but we still faced:
- bundle and model size concerns
- cross-origin isolation requirements
- uneven quality on Korean technical meeting audio
So even though the runtime model was attractive, the end-to-end product constraints remained.
Performance Reality
In practice, real-time experience depended heavily on hardware class:
- desktop: acceptable for some model sizes and configs
- mobile: often too slow for a smooth live transcript
For our use case (live meeting capture), users care most about immediate feedback. Delayed "better" text is useful, but not enough by itself.
Korean Technical Speech Is a Special Case
Korean meeting audio with mixed English technical vocabulary is difficult for generic ASR pipelines:
- acronyms
- code and infra terms
- fast topic shifts
- overlapping speech
In this setting, domain-aware correction matters as much as raw model quality.
Where Browser Whisper Actually Works Well
We did not abandon browser inference. We narrowed the scope.
Great use cases:
- offline transcription of pre-recorded audio
- privacy-sensitive local processing
- non-real-time workflows where users accept delay
Weak use case today:
- low-latency, always-on live meeting transcription on mixed client hardware
Product Decision: Labs, Not Core Path
We kept Whisper in our Labs track as an experimental feature and kept real-time transcription in the main app on a lower-latency path.
We are still exploring hybrid workflows:
- live transcript first
- higher-accuracy pass afterward
This gives users immediate usability and a better final artifact.
Takeaway
Before committing to browser AI for production, benchmark the exact user path on real user hardware.
If the product needs immediate output, optimize for latency first. If the product needs maximum accuracy and privacy on recorded content, browser inference can be a strong fit.