← Back to blog

Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)

How VORA uses N-best candidates from the Web Speech API — reranking with domain dictionaries and priority terms — to cut word error rate by 8% before any LLM call.

by Jay5 min readVORA B.LOG

Most speech apps take the top transcript from the Web Speech API and call it a day. I used to do the same thing. Then I noticed the API was consistently transcribing "API" as "A P I" and "LogP" as "log pee," and I decided that maybe the top result wasn't always the best result.

Turns out the Web Speech API gives you multiple transcript candidates with confidence scores. Most apps just ignore the extras. I didn't. And it made a surprisingly big difference -- without a single additional API call.

🎯 What the Web Speech API Actually Gives You

When a final onresult event fires, you get a SpeechRecognitionResult with multiple alternatives:

event.results[0][0] = { transcript: "API integration failed", confidence: 0.87 }
event.results[0][1] = { transcript: "A P I integration failed", confidence: 0.85 }
event.results[0][2] = { transcript: "API integration fail", confidence: 0.83 }

In Korean STT on Chrome, I typically see 2-3 alternatives with very close confidence values. All plausible acoustically. But in a software meeting, the first candidate is usually the one you actually want.

The catch: confidence scores reflect acoustic likelihood, not domain intent. The speech model doesn't know you're in a pharma meeting. It doesn't know that "log pee" should be "LogP." It's just listening to sounds and making its best guess.

🐛 Why Top-1 Breaks in Technical Meetings

For casual conversation, the top result is usually fine. But in technical discussions, it falls apart:

  • Acronyms: API, SDK, PCR -- the model doesn't know these are single terms
  • Mixed-language jargon: Korean sentences with English technical terms sprinkled in
  • Product-specific terms: your internal tool names, your company abbreviations
  • Proper nouns and model names

The acoustic model isn't wrong. It's just under-informed. It doesn't know your meeting context. So I gave it some.

🔑 The Reranking Strategy

ASR Model Context: Qwen3-ASR Coverage

For each candidate in the N-best list, I compute a composite score:

  1. Start with the API confidence
  2. Apply local dictionary correction
  3. Add a bonus for detected technical terms
  4. Add a stronger bonus for user priority terms
  5. Add a small bonus if the correction changed the text
for (const candidate of candidates) {
  let score = candidate.confidence || 0;
  const corrected = quickCorrect(candidate.transcript);
 
  const technicalTerms = corrected.match(
    /[A-Z][A-Za-z0-9]+|LogP|pKa|IC50|Cmax|PCR|ELISA/g
  );
  if (technicalTerms) score += technicalTerms.length * 0.1;
 
  for (const term of priorityTerms) {
    if (corrected.includes(term)) score += 0.3;
  }
 
  if (corrected !== candidate.transcript) score += 0.05;
 
  if (score > bestScore) {
    bestScore = score;
    bestCandidate = candidate;
  }
}

Cheap, deterministic, and fast enough to run on every single final STT segment. No API call required.

🧠 Session Dictionary: The System That Learns During Meetings

I also built a session dictionary that picks up patterns as the meeting goes on.

If the system confirms the same correction multiple times (raw → corrected), it stores that pair locally. After a confidence threshold, it starts applying the correction instantly -- no LLM needed.

This helps enormously with repeated terms. If someone says "CYP3A4" twenty times in a pharma meeting, the first one might get mangled. By the third or fourth, the session dictionary catches it automatically.

💡 Why Not Just Send Everything to an LLM?

Two words: latency and quota.

Real-time meeting transcription generates a lot of final segments per minute. If every segment triggers a remote LLM call, I'd blow through rate limits in minutes and introduce visible lag that makes the app feel broken.

My approach:

  • Local reranking + dictionary first (free, instant)
  • External AI correction only when needed (expensive, slower)

This cuts unnecessary API traffic while keeping quality high where semantic reasoning actually matters. It's like having a spell checker handle the obvious stuff so the editor can focus on the hard problems.

🎯 Domain Personas Make AI Corrections Smarter

When I do call an LLM, the prompt context is domain-specific. A pharma meeting prompt looks different from an IT/software prompt. Same tokens can mean very different things depending on the field.

I also include a short rolling context window from recent corrected utterances. This helps the model understand what the meeting is about, not just what the current sentence sounds like.

📊 The Numbers

In internal tests on mixed Korean-English technical meeting audio:

  • N-best reranking only: ~8% relative WER reduction vs top-1
  • Local dictionary: additional ~15% reduction in domain term errors
  • LLM correction: additional ~12% reduction in semantic errors
  • Session learning (after warm-up): additional ~6% reduction

The biggest wins came from domain-heavy meetings. In casual conversation, the improvement was smaller -- which makes sense. If everyone's talking about lunch plans, you don't need domain reranking.

💡 What I Took Away

If your speech pipeline supports alternatives, use them. Don't throw away the extra candidates just because nobody else uses them.

A small amount of domain configuration -- a dictionary, some priority terms, a quick reranking pass -- improves accuracy more than most people expect. And it's infinitely cheaper than adding more model calls.

N-best reranking became the first gate in my correction pipeline: fast, local, and surprisingly effective for such a simple idea.

2026.02.08


This post covers the pre-AI correction layer. For what happens when the AI correction itself goes rogue -- rewriting normal speech, doubling prompts, and matching too broadly -- see How I Fixed AI Over-correction.

Written by

Jay

Licensed Pharmacist · Senior Researcher

Building production-grade AI tools across medicine, finance, and productivity — without a CS degree. Domain expertise first, code second.

About the author →
ShareX / TwitterLinkedIn