Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)
Most speech apps only use the top transcript returned by the Web Speech API. We decided to use the full N-best list instead: multiple transcript candidates with confidence scores. For domain-heavy meetings (software, biotech, pharma), that simple dec...
Series: VORA B.LOG
- 1. Why I shipped VORA before writing a single line of backend code
- 2. From Python Server to Pure Browser: The Architecture Pivot That Changed Everything
- 3. The Whisper WASM Experiment: Why Browser AI Is Harder Than It Looks
- 4. Why We Killed Speaker Identification (And What We Learned from Two Weeks of Failure)
- 5. Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls) ← you are here
- 6. Building the Priority Queue: How We Stopped Gemini API Chaos — and Why the First Two Designs Both Failed
- 7. Groq Dual-AI Integration: Why We Added a Second AI and What It Actually Fixed
- 8. The Meeting Summary Timer Bug: Why setInterval Isn't Enough for Reliable Scheduling
- 9. Building a Real Meeting Export: From Raw Transcript to a Usable Report
- 10. The Dark Theme Redesign: Building a UI That Looks Like a Professional Tool (After It Looked Like a Hobbyist Project)
- 11. The Branding Journey: From a Functional Name to VORA
- 12. How We Made VORA Bilingual Without a Heavy Localization Stack
- 13. Deploying to Cloudflare Pages: Static Hosting, CORS Headers, and the Sitemap/Robots Incident
- 14. How I Fixed AI Over-correction
Most speech apps only use the top transcript returned by the Web Speech API. We decided to use the full N-best list instead: multiple transcript candidates with confidence scores.
For domain-heavy meetings (software, biotech, pharma), that simple decision made a real difference. By reranking candidates locally with domain context, we improved recognition quality without increasing API usage.
This post explains what we built and why it worked.
What the Web Speech API Actually Gives You
When a final onresult event fires, you get a SpeechRecognitionResult that may contain multiple alternatives:
transcript: recognized textconfidence: score from 0.0 to 1.0
In Korean STT on Chrome, we often see 2-3 alternatives with very close confidence values.
Example:
event.results[0][0] = { transcript: "API integration failed", confidence: 0.87 }
event.results[0][1] = { transcript: "A P I integration failed", confidence: 0.85 }
event.results[0][2] = { transcript: "API integration fail", confidence: 0.83 }
All three are plausible acoustically. But in a software meeting, the first candidate is usually the intended form.
The key issue: confidence scores reflect acoustic likelihood, not domain intent.
Why Confidence Alone Fails in Technical Meetings
For casual speech, top-1 is often fine. In technical discussions, it breaks more often:
- acronyms (
API,SDK,PCR) - mixed-language jargon
- product-specific terms
- proper nouns and model names
Acoustic models are not wrong here; they are just under-informed. They do not know your meeting context.
So we added a local reranking step that injects domain knowledge before we call any external LLM.
Local Reranking Strategy
For each candidate in the N-best list, we compute a composite score:
- start with API confidence
- apply local dictionary correction
- add bonus for detected technical terms
- add stronger bonus for user priority terms
- add a small bonus if correction changed the text
for (const candidate of candidates) {
let score = candidate.confidence || 0;
const corrected = quickCorrect(candidate.transcript);
const technicalTerms = corrected.match(
/[A-Z][A-Za-z0-9]+|LogP|pKa|IC50|Cmax|PCR|ELISA/g
);
if (technicalTerms) score += technicalTerms.length * 0.1;
for (const term of priorityTerms) {
if (corrected.includes(term)) score += 0.3;
}
if (corrected !== candidate.transcript) score += 0.05;
if (score > bestScore) {
bestScore = score;
bestCandidate = candidate;
}
}
This is cheap, deterministic, and fast enough to run on every final STT segment.
Session Dictionary: Lightweight Learning During Meetings
We also added a session dictionary that learns corrections over time.
If the system repeatedly confirms the same correction pattern, it starts applying that correction locally for the rest of the meeting.
Flow:
- observe correction pair (
raw -> corrected) - increment confidence counter
- once threshold is reached, store it in session dictionary
- apply instantly in future utterances without another API call
This helps a lot with repeated technical terms and proper nouns.
Why Not Send Everything to an LLM?
Two reasons: latency and quota.
Real-time meeting transcription generates many final segments per minute. If every segment triggers a remote correction call, you quickly hit rate limits and introduce visible lag.
Our approach:
- local reranking + dictionary first
- external AI correction only when needed
In practice, this cuts unnecessary API traffic while preserving quality where semantic reasoning is truly required.
Domain Personas Improve AI Corrections
When we do call an LLM, prompt context is domain-specific (e.g., pharma, biotech, IT/software, general).
Why this matters:
- identical tokens can mean different things across fields
- abbreviations are interpreted differently by domain
- recent context changes how ambiguous phrases should be resolved
We also include a short rolling context window from recent corrected utterances to improve continuity.
Measured Impact
In internal tests on mixed Korean-English technical meeting audio:
- N-best reranking only: ~8% relative WER reduction vs top-1 only
- local dictionary: additional ~15% reduction in domain term errors
- LLM correction: additional ~12% reduction in semantic errors
- session learning (after warm-up): additional ~6% reduction
The biggest win from reranking appeared in domain-heavy meetings. In casual speech, improvement was smaller, which is expected.
Practical Takeaway
If your speech pipeline supports alternatives or domain settings, use them.
A small amount of domain configuration usually improves accuracy more than people expect, and it is much cheaper than adding more model calls.
For us, N-best reranking became the first gate in the correction pipeline: fast, local, and surprisingly effective.