The VORA Overhaul: Dropping Real-Time Q&A, Building Human-in-the-Loop Memos, and a Three-Column Layout

VORA shipped with a panel that was supposed to detect questions in real-time and send them to Gemini for an answer. It was one of the first things I designed. It was also one of the first things that had to go — not because the idea was wrong, but because the technical foundations it required weren't there.

This is the story of what replaced it, and why the original concept is still worth finishing someday.

❌ Removing Real-Time Q&A

The Q&A panel lived in VORA's center column: question detection on top, Gemini answers below. Every detected question was supposed to trigger a Gemini call and surface an answer before the meeting moved on. RPM limits were one problem. The deeper problem was that the whole thing ran on assumptions that the real world doesn't satisfy.

The concept was sound. The implementation wasn't — at least not yet.

Detecting whether a sentence is a question requires processing transcription in near-real-time. Two constraints hit at once: the latency of the STT pipeline, and the hardware setup in the room. The second one is the harder problem.

There's also a tolerance gap that makes Q&A fundamentally different from transcription. Live transcription can afford to get words wrong — the continuous feedback loop preserves the overall flow, context fills the gaps. Q&A recognition can't. A single misrecognized word flips the entire meaning. "When do we ship?" and "Why don't we ship?" are nearly the same phoneme sequence under enough noise. They produce opposite answers. The quality bar is orders of magnitude higher, and it has to be cleared in real-time.

This feature only makes sense if every participant is wearing their own microphone — clean, isolated audio per speaker, no bleed, no ambiguity about who's talking. In a typical meeting with a shared mic or a laptop sitting in the middle of a table, the transcription is too messy to classify individual utterances reliably. By the time a sentence is transcribed, segmented, classified, and a Gemini call completes, the conversation has moved on. The panel existed in the UI but the infrastructure assumptions underneath it didn't hold.

The removal was surprisingly clean. All Q&A DOM access in the JavaScript was already written with optional chaining, so deleting the HTML elements caused zero errors. I'd like to take credit for that foresight. Realistically, I write everything with optional chaining because I'm moving fast and it's just habit. The CSS collapsed from three equal columns to two.

/* Before */
.main-content { grid-template-columns: repeat(3, 1fr); }
 
/* After: transcription gets more space */
.main-content {
  grid-template-columns: 1.45fr 1fr;
  gap: 24px;
}

🏷️ Human-in-the-Loop Memo System

The freed-up space got replaced with something better: a system where humans make the judgment calls, and the AI acts on them.

Quick Tags. Every transcript line gets five tag buttons: 🔥 Important, ✅ Decision, 📋 Action, ⚠️ Issue, 💬 Memo. One click tags a line. Shift+click filters the transcript to show only that tag.

Memo-Aware Summaries. Tagged data flows into the Gemini summary prompt — not just as raw text, but with explicit rules for what to do with each tag type:

tagSection = `
[User Memos & Tags — participant's real-time judgment, apply first]
${taggedData.text}
 
---Rules---
✅ Decision  → include in "Decisions" section
📋 Action    → include in "Action Items" section
🔥 Important → prioritize at top of summary
⚠️ Issue     → call out in "Open Issues / Risks"
💬 Memo      → use to enrich relevant discussion points`;

The model doesn't just see your tags. It has a brief for what each one means.

Audio Timestamp Sync. Every transcript entry gets a ▶ button that scrubs to that exact point in the recorded audio. Pause history is accumulated to compute the correct offset even after multiple pause/resume cycles:

getPauseAwareOffset() {
  const totalPausedMs = this.pauseHistory.reduce((sum, p) => {
    const end = p.end || Date.now();
    return sum + (end - p.start);
  }, 0);
  return Math.max(0, (Date.now() - this.recordingStartTime) - totalPausedMs);
}

📝 Free Memo Input

Tagging works, but it requires finding the right transcript line first. Sometimes you just want to type a thought immediately. A persistent input bar handles that — Enter saves it with a timestamp and audio offset, and the memo gets injected into the next Gemini summary automatically. No searching, no clicking transcript items.

✨ Markdown Renderer

Gemini's summaries use headers, bold text, and bullet points. Previously those were escaped as plain text and rendered literally — ## and ** visible on screen. The fix was a custom renderMarkdown() function: XSS-escape everything first, then parse line by line for headings, lists, and inline formatting.

Applied only to the meeting summary panel. The AI chat sidebar stays plain text — markdown in a narrow column looks cluttered, not helpful.

🐛 Nineteen Bugs

New features, new bugs. Nineteen of them. I'm choosing to call this thorough documentation rather than a rough week.

The critical ones:

Audio player didn't play immediately after stopping a recording
Shift+click tag filter fired inconsistently
Memo richness score reset on every new summary generation
Pause/resume audio offset accumulated incorrectly
Summary status badge got stuck on "Generating"
Chat input duplicated Enter key submissions
Groq fast-correction mode overwrote original text with an empty result on API failure

Plus twelve more in the medium/low range: animation flicker on memo list, badge styling artifacts, empty memos being saved without validation.

📐 Three-Column Layout

The two-column layout had a space problem. The right column stacked Meeting Summary above AI Chat vertically. Summary took 60–70% of that column. AI Chat got what was left.

Before:
Right column (43% of screen)
├── Meeting Summary   60–70%
└── AI Chat           30–40%   ← too small

That put AI Chat at roughly 15% of total screen area. Answers appeared in a narrow strip that required constant scrolling.

The fix: three independent columns. Transcription left, Meeting Summary center, Memo+Chat combined right.

/* Before */
.main-content { grid-template-columns: 1.3fr 1fr; }
 
/* After */
.main-content {
  grid-template-columns: 1fr 1.6fr 1.2fr;
  gap: 16px;
}

Inside the third column, the memo section is fixed-height at the top — input bar plus two lines of recent memos. The rest goes entirely to AI Chat. Result: Chat went from ~15% to ~26% of screen area. In practice the visible response area roughly doubled.

Zero JavaScript changes. Every DOM lookup in VORA uses document.getElementById(), so moving elements around in HTML doesn't break anything — same IDs, same references. The one removed button was already wrapped in a null check.

// already null-safe before the refactor
if (el.chatToggleBtn) {
  el.chatToggleBtn.addEventListener('click', () => { ... });
}

What Actually Mattered

Replacing the Q&A panel with Quick Tags — one-click human judgment that feeds directly into the AI's summarization context — made the summaries noticeably better. The model was working from explicit signals instead of guessing what mattered from a wall of transcript text.

Layout turns out to matter more than expected. Going from 15% to 26% screen area sounds like a modest stat. Sitting in front of it feels like a completely different application.

The Lesson Nobody Tells You About SaaS

Building a usable SaaS product is harder than it looks. Not in the "writing code is hard" sense — that part is manageable. Hard in the sense that the feature you most want to build is often the one you can't ship yet.

Real-time Q&A detection was the feature I was most excited about when I started VORA. AI listens to the meeting, detects a question, answers it before anyone has to pause. The concept still makes complete sense. But it rests on an assumption most meetings don't satisfy: every participant with their own microphone. Without that, you get mixed audio that's hard enough to transcribe correctly, let alone parse for intent in real-time. Stack the STT latency on top, and there's no path to making it feel instantaneous.

So it got cut. Not because the idea was wrong. Because the tech isn't there yet.

The Q&A engine isn't gone. It's in the backroom. The plan is to rebuild it in the lab — better hardware assumptions, smarter buffering, a different approach to latency — until it earns its place back in the main app. That might take a while. The work doesn't stop.

2026.03.04

For how VORA corrects STT transcription errors before any of this reaches the UI, see How I Fixed AI Over-correction.

→ VORA