A Build-Pipeline Checklist for Catching AI-Fabricated Research Citations

AI pulls literature data fast. AI also pulls literature data confidently when none of it exists. Both behaviors come from the same place: the model has no built-in way to be sure that the PMID it just typed actually points to a paper that contains the sentence it just quoted.

For a quick read on a coffee break, this is fine — you eyeball the output and move on. For anything that lands in a report (a regulatory filing, an internal memo, a literature review that someone is going to defend in a meeting), "eyeball and trust" is malpractice. You need a check that runs every time the data ships, costs almost nothing, and refuses to ship if the quote isn't where it claims to be.

That check exists. It's six lines of pseudo-code. Here's the recipe, plus the two extra layers that cover the cases where byte-match alone won't save you.

What this guide assumes: you're asking an LLM (Claude, GPT, Gemini, doesn't matter which) to gather data with citations — typically { value, source_url, excerpt } triples — and stash them in a JSON or YAML file. If your AI workflow doesn't produce structured citations, half the win here doesn't apply, but the framing in §3 still does.

1. The trick: byte-match the excerpt against the live source URL

The whole insight in one sentence: before you ship a dataset, fetch every cited URL and check that the quoted excerpt is literally a substring of the response body.

for (const source of dataset.sources) {
  const body = await fetch(source.url).then(r => r.text())
  const stripped = stripHtml(body)
  if (!stripped.includes(source.excerpt)) {
    throw new Error(`Provenance fail: ${source.url}`)
  }
}

That's it. If the AI hallucinated the PMID, the URL will 404 or return a "no results" page that doesn't contain your excerpt. If it hallucinated the excerpt (URL real, quote invented), the substring check fails. Either way the build dies before the data reaches anyone who can be misled by it.

[AI writes data]            [Build pipeline]
 │                           │
 ├─ value: 410.5             ├─ for each source:
 ├─ url:   pubmed/35298711   │     fetch URL
 └─ excerpt: "IDR <0.1..."   │     strip HTML
                             │     check excerpt ∈ body
                             │
                             ├─ all match? → ship it
                             └─ any miss?  → block deploy
                                              ↑
                                     ← report-grade trust

There's nothing clever here. It's the most obvious check in the world. The reason most AI data pipelines don't do it is that the test happens at the wrong layer — inside the prompt instead of inside the build.

Programmatic fetch tip: for PubMed-cited fields, the official Entrez E-utilities API returns clean XML/JSON that's already free of JavaScript rendering. That's faster and more reliable than scraping the user-facing PubMed page. For DOI-based citations, the CrossRef REST API returns canonical bibliographic data you can byte-match against the AI's claims directly.

Why prompt-level checking is weaker: asking the model "are you sure this PMID is real?" gets you a confident "yes" because that's what the model is best at. Byte-match doesn't ask. It looks.

2. Run it where humans aren't watching: the build pipeline

The check has to be automatic and unskippable. The easiest place to wire that is the build that produces your deploy artifact.

For a static site, that's npm run build. For a serverless function, it's the deploy hook. For a notebook, it's the cell that exports to PDF. The principle is the same: no excerpt verification, no artifact.

A practical sequence:

git commit
   ↓
git push
   ↓
CI runs `npm run validate`     ← schema check (Zod, pydantic, etc.)
CI runs `npm run provenance`   ← byte-match every source
CI runs `npm run build`         ← only if both pass
   ↓
Deploy

Two practical concerns surface immediately, and both have answers.

"Upstream is flaky — PubMed 503'd once and broke my deploy." Cache the response body keyed by URL hash, with a staleness window (30 days is sane for academic citations). Treat a fresh fetch as authoritative; fall back to cache only if you set an explicit VALIDATION_OVERRIDE=1 env flag. Commit the cache to git so the build is reproducible from a checkout months later, even if the URL goes dead.

"Build minutes are expensive." Cap concurrency (8 parallel fetches is plenty), cache aggressively, and accept that cold builds run a couple of minutes longer than warm ones. The minutes you spend on this are cheaper than the meeting where someone realizes a cited paper doesn't exist.

Real example: in the LAI scoring tool I shipped at lai.vibed-lab.com, one candidate had ~22 cited URLs across 18 axes. Cold build with all-fresh fetches: under 30 seconds at concurrency 8. Warm build hitting the cache: under 3 seconds. The cost is invisible.

3. What byte-match catches and what it doesn't

This is the part where most guides stop and you walk away thinking the problem is solved. It isn't, fully. Byte-match has a known coverage map:

Failure mode	Byte-match catches it?
Hallucinated PMID / URL	✅ URL 404 or "not found" page → excerpt not in body
Hallucinated excerpt (URL real, quote invented)	✅ exact match fails
Paraphrased excerpt instead of verbatim	❌ excerpt not literally in source
Source URL silently mutated (paywall added, content rewritten)	⚠ catches if excerpt now missing, misses if still present in altered context
Source is a JS-rendered SPA (excerpt in DOM, not raw HTML)	❌ unless you render with a headless browser
Citation context is wrong (right quote, wrong meaning)	❌ can't catch — semantic, not literal
Domain-specific judgment ("is this study population relevant?")	❌ never going to catch — that's a human call

The honest framing is: byte-match is the cheapest layer that catches the dumbest failures. It eliminates fabrication, which is most of what hurts AI-collected data. It does not eliminate misinterpretation, which is the next problem up the food chain.

For the gaps, you need two more layers.

4. Layer 2: cross-source verification

Force the AI to cite at least two independent sources for every numeric field. If they disagree, list both values and lower the confidence score on that field.

{
  value: 410.5,
  unit: "g/mol",
  sources: [
    { url: "pubchem.../5073", excerpt: "Molecular Weight 410.5 g/mol" },
    { url: "drugbank.../DB00734", excerpt: "Molecular Weight: 410.5" }
  ],
  confidence: 0.99   // 1.0 if single-source verbatim, 0.9 if 2+ agree, 0.5 if disagree
}

This is cheap (just a prompt instruction) and it catches a different class of error than byte-match: the case where one source is wrong and the AI happens to pick that one. With two sources you'd see the disagreement, lower the confidence, and surface it for human review.

How to prompt the AI: "For every numeric field, query at least two independent sources. Quote each verbatim in excerpt. If the values disagree, list all sources and set confidence ≤ 0.5. Do not synthesize a 'best estimate' across conflicting sources without flagging it."

5. Layer 3: human verification queue (semantic gate)

Some fields will never be machine-verifiable: qualitative scores, "is this drug indication chronic?", "is this IP claim still active?". For those, the AI's job is to flag, not decide.

The pattern is a needs_human_review[] array on every record. The AI populates it automatically when:

The field is qualitative (no numeric ground truth exists)
confidence < 0.7 (single source or disagreeing sources)
The field would trigger a downstream gate or threshold (deserves a sanity check before it filters anything out)

The human-facing UI shows these in an inbox view, lets a person spot-check the cited URL in a new tab, and toggles verified / disputed on the record. The point is the AI never claims a field is verified — only the human can do that.

The three layers cascade like this:

[Build pipeline]         [Authoring time]         [Browse/review time]
 ↓                         ↓                         ↓
Byte-match                Cross-source              Human inbox
"is the quote real?"      "do sources agree?"       "is the meaning right?"
 ↓                         ↓                         ↓
Catches fabrication        Catches single-source     Catches misinterpretation
                           error

Each layer protects against a different failure class. Skipping any one means a different category of bad data slips through. Skipping all three means you have AI-flavored vibes, not data.

6. Cheatsheet

You want...	Easiest tool
Block hallucinated URLs/PMIDs	Byte-match excerpt vs fetched body in build
Survive flaky upstream sources	Cache response, 30-day staleness, override flag
Catch single-source errors	Force ≥2 independent citations per numeric field
Catch domain-judgment errors	Human verification inbox; AI flags but doesn't decide
Avoid noisy false positives in CI	Concurrency cap + cache + commit cache to git
Audit a published report later	Tag git SHA + bundle hash in the export footer

7. The 3 takeaways

Don't ask the model "are you sure?". Look. Byte-match the cited excerpt against the live URL at build time. The check is mechanical, the model has no influence over it, and 90% of fabrication dies here.
Stack three layers, because each catches a different failure class. Byte-match → cross-source → human inbox. Skip any layer and a different category of bad data ships.
The AI is a fast collector, not a verifier. Wire the system so the only path from "Claude wrote it" to "this is in the report" passes through a human toggle. The AI's job is to make that toggle as fast as possible, not to remove it.

The full implementation of this pattern (3-layer cascade, build-time provenance with cache, inbox UI) is open source as part of the LAI Score druggability tool. If you want to copy the structure rather than rebuild it, the relevant files are scripts/provenance-check.ts, scripts/CLAUDE.md (the externalized agent protocol), and src/components/VerifyToggle.tsx.

A few adjacent posts you might want next: the dual-AI review workflow that ships these tools end-to-end, how to wire Claude Code subagents so each task gets a fresh context, and the way I stopped my AI from over-correcting once you start adding guardrails like these.

2026.05.08