Building an N-Best Reranking Layer for Better Korean STT (Without Extra API Calls)

Most speech apps only use the top transcript returned by the Web Speech API. We decided to use the full N-best list instead: multiple transcript candidates with confidence scores.

For domain-heavy meetings (software, biotech, pharma), that simple decision made a real difference. By reranking candidates locally with domain context, we improved recognition quality without increasing API usage.

This post explains what we built and why it worked.

What the Web Speech API Actually Gives You

When a final onresult event fires, you get a SpeechRecognitionResult that may contain multiple alternatives:

transcript: recognized text
confidence: score from 0.0 to 1.0

In Korean STT on Chrome, we often see 2-3 alternatives with very close confidence values.

Example:

event.results[0][0] = { transcript: "API integration failed", confidence: 0.87 }
event.results[0][1] = { transcript: "A P I integration failed", confidence: 0.85 }
event.results[0][2] = { transcript: "API integration fail", confidence: 0.83 }

All three are plausible acoustically. But in a software meeting, the first candidate is usually the intended form.

The key issue: confidence scores reflect acoustic likelihood, not domain intent.

Why Confidence Alone Fails in Technical Meetings

For casual speech, top-1 is often fine. In technical discussions, it breaks more often:

acronyms (API, SDK, PCR)
mixed-language jargon
product-specific terms
proper nouns and model names

Acoustic models are not wrong here; they are just under-informed. They do not know your meeting context.

So we added a local reranking step that injects domain knowledge before we call any external LLM.

Local Reranking Strategy

ASR Model Context: Qwen3-ASR Coverage

For each candidate in the N-best list, we compute a composite score:

start with API confidence
apply local dictionary correction
add bonus for detected technical terms
add stronger bonus for user priority terms
add a small bonus if correction changed the text

for (const candidate of candidates) {
  let score = candidate.confidence || 0;
  const corrected = quickCorrect(candidate.transcript);

  const technicalTerms = corrected.match(
    /[A-Z][A-Za-z0-9]+|LogP|pKa|IC50|Cmax|PCR|ELISA/g
  );
  if (technicalTerms) score += technicalTerms.length * 0.1;

  for (const term of priorityTerms) {
    if (corrected.includes(term)) score += 0.3;
  }

  if (corrected !== candidate.transcript) score += 0.05;

  if (score > bestScore) {
    bestScore = score;
    bestCandidate = candidate;
  }
}

This is cheap, deterministic, and fast enough to run on every final STT segment.

Session Dictionary: Lightweight Learning During Meetings

We also added a session dictionary that learns corrections over time.

If the system repeatedly confirms the same correction pattern, it starts applying that correction locally for the rest of the meeting.

Flow:

observe correction pair (raw -> corrected)
increment confidence counter
once threshold is reached, store it in session dictionary
apply instantly in future utterances without another API call

This helps a lot with repeated technical terms and proper nouns.

Why Not Send Everything to an LLM?

Two reasons: latency and quota.

Real-time meeting transcription generates many final segments per minute. If every segment triggers a remote correction call, you quickly hit rate limits and introduce visible lag.

Our approach:

local reranking + dictionary first
external AI correction only when needed

In practice, this cuts unnecessary API traffic while preserving quality where semantic reasoning is truly required.

Domain Personas Improve AI Corrections

When we do call an LLM, prompt context is domain-specific (e.g., pharma, biotech, IT/software, general).

Why this matters:

identical tokens can mean different things across fields
abbreviations are interpreted differently by domain
recent context changes how ambiguous phrases should be resolved

We also include a short rolling context window from recent corrected utterances to improve continuity.

Measured Impact

In internal tests on mixed Korean-English technical meeting audio:

N-best reranking only: ~8% relative WER reduction vs top-1 only
local dictionary: additional ~15% reduction in domain term errors
LLM correction: additional ~12% reduction in semantic errors
session learning (after warm-up): additional ~6% reduction

The biggest win from reranking appeared in domain-heavy meetings. In casual speech, improvement was smaller, which is expected.

Practical Takeaway

If your speech pipeline supports alternatives or domain settings, use them.

A small amount of domain configuration usually improves accuracy more than people expect, and it is much cheaper than adding more model calls.

For us, N-best reranking became the first gate in the correction pipeline: fast, local, and surprisingly effective.