Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)
How VORA uses N-best candidates from the Web Speech API — reranking with domain dictionaries and priority terms — to cut word error rate by 8% before any LLM call.
Series: VORA B.LOG
- 1. Why I shipped VORA before writing a single line of backend code
- 2. From Python Server to Pure Browser: The Architecture Pivot That Changed Everything
- 3. The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks
- 4. Why We Killed Speaker Identification (And What We Learned from Two Weeks of Failure)
- 5. Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls) ← you are here
- 6. Building the Priority Queue: How We Stopped Gemini API Chaos — and Why the First Two Designs Both Failed
- 7. Groq Dual-AI Integration: Why I Added a Second AI and What It Actually Fixed
- 8. The Meeting Summary Timer Bug: Why setInterval Isn't Enough for Reliable Scheduling
- 9. Building a Real Meeting Export: From Raw Transcript to a Usable Report
- 10. The Dark Theme Redesign: Building a UI That Looks Like a Professional Tool (After It Looked Like a Hobbyist Project)
- 11. The Branding Journey: From a Functional Name to VORA
- 12. How We Made VORA Bilingual Without a Heavy Localization Stack
- 13. Deploying to Cloudflare Pages: Static Hosting, CORS Headers, and the Sitemap/Robots Incident
- 14. How I Fixed AI Over-correction
- 15. The VORA Overhaul: Dropping Real-Time Q&A, Building Human-in-the-Loop Memos, and a Three-Column Layout
Most speech apps take the top transcript from the Web Speech API and call it a day. I used to do the same thing. Then I noticed the API was consistently transcribing "API" as "A P I" and "LogP" as "log pee," and I decided that maybe the top result wasn't always the best result.
Turns out the Web Speech API gives you multiple transcript candidates with confidence scores. Most apps just ignore the extras. I didn't. And it made a surprisingly big difference -- without a single additional API call.
🎯 What the Web Speech API Actually Gives You
When a final onresult event fires, you get a SpeechRecognitionResult with multiple alternatives:
event.results[0][0] = { transcript: "API integration failed", confidence: 0.87 }
event.results[0][1] = { transcript: "A P I integration failed", confidence: 0.85 }
event.results[0][2] = { transcript: "API integration fail", confidence: 0.83 }In Korean STT on Chrome, I typically see 2-3 alternatives with very close confidence values. All plausible acoustically. But in a software meeting, the first candidate is usually the one you actually want.
The catch: confidence scores reflect acoustic likelihood, not domain intent. The speech model doesn't know you're in a pharma meeting. It doesn't know that "log pee" should be "LogP." It's just listening to sounds and making its best guess.
🐛 Why Top-1 Breaks in Technical Meetings
For casual conversation, the top result is usually fine. But in technical discussions, it falls apart:
- Acronyms:
API,SDK,PCR-- the model doesn't know these are single terms - Mixed-language jargon: Korean sentences with English technical terms sprinkled in
- Product-specific terms: your internal tool names, your company abbreviations
- Proper nouns and model names
The acoustic model isn't wrong. It's just under-informed. It doesn't know your meeting context. So I gave it some.
🔑 The Reranking Strategy
For each candidate in the N-best list, I compute a composite score:
- Start with the API confidence
- Apply local dictionary correction
- Add a bonus for detected technical terms
- Add a stronger bonus for user priority terms
- Add a small bonus if the correction changed the text
for (const candidate of candidates) {
let score = candidate.confidence || 0;
const corrected = quickCorrect(candidate.transcript);
const technicalTerms = corrected.match(
/[A-Z][A-Za-z0-9]+|LogP|pKa|IC50|Cmax|PCR|ELISA/g
);
if (technicalTerms) score += technicalTerms.length * 0.1;
for (const term of priorityTerms) {
if (corrected.includes(term)) score += 0.3;
}
if (corrected !== candidate.transcript) score += 0.05;
if (score > bestScore) {
bestScore = score;
bestCandidate = candidate;
}
}Cheap, deterministic, and fast enough to run on every single final STT segment. No API call required.
🧠 Session Dictionary: The System That Learns During Meetings
I also built a session dictionary that picks up patterns as the meeting goes on.
If the system confirms the same correction multiple times (raw → corrected), it stores that pair locally. After a confidence threshold, it starts applying the correction instantly -- no LLM needed.
This helps enormously with repeated terms. If someone says "CYP3A4" twenty times in a pharma meeting, the first one might get mangled. By the third or fourth, the session dictionary catches it automatically.
💡 Why Not Just Send Everything to an LLM?
Two words: latency and quota.
Real-time meeting transcription generates a lot of final segments per minute. If every segment triggers a remote LLM call, I'd blow through rate limits in minutes and introduce visible lag that makes the app feel broken.
My approach:
- Local reranking + dictionary first (free, instant)
- External AI correction only when needed (expensive, slower)
This cuts unnecessary API traffic while keeping quality high where semantic reasoning actually matters. It's like having a spell checker handle the obvious stuff so the editor can focus on the hard problems.
🎯 Domain Personas Make AI Corrections Smarter
When I do call an LLM, the prompt context is domain-specific. A pharma meeting prompt looks different from an IT/software prompt. Same tokens can mean very different things depending on the field.
I also include a short rolling context window from recent corrected utterances. This helps the model understand what the meeting is about, not just what the current sentence sounds like.
📊 The Numbers
In internal tests on mixed Korean-English technical meeting audio:
- N-best reranking only: ~8% relative WER reduction vs top-1
- Local dictionary: additional ~15% reduction in domain term errors
- LLM correction: additional ~12% reduction in semantic errors
- Session learning (after warm-up): additional ~6% reduction
The biggest wins came from domain-heavy meetings. In casual conversation, the improvement was smaller -- which makes sense. If everyone's talking about lunch plans, you don't need domain reranking.
💡 What I Took Away
If your speech pipeline supports alternatives, use them. Don't throw away the extra candidates just because nobody else uses them.
A small amount of domain configuration -- a dictionary, some priority terms, a quick reranking pass -- improves accuracy more than most people expect. And it's infinitely cheaper than adding more model calls.
N-best reranking became the first gate in my correction pipeline: fast, local, and surprisingly effective for such a simple idea.
2026.02.08
This post covers the pre-AI correction layer. For what happens when the AI correction itself goes rogue -- rewriting normal speech, doubling prompts, and matching too broadly -- see How I Fixed AI Over-correction.
Written by
Jay
Licensed Pharmacist · Senior Researcher
Building production-grade AI tools across medicine, finance, and productivity — without a CS degree. Domain expertise first, code second.
About the author →Related posts