Skip to content
← Back to blog

Designing Frontends Claude Can Actually Use — A 7-Step Field Guide From the Day My Scoring App Got Audited by Its Own AI

Last Wednesday, Claude navigated my multi-criteria scoring web app, evaluated a sample candidate, flagged a potential methodology bias, worked through a fix with me, ran 305 tests, and shipped the patch to production — all in one afternoon. Here is what made that possible (and what would have made it impossible).

by Jay Lee13 min readGuides
Warning: Disclaimer: Educational content about software design and AI-driven frontend usage. The scoring tool referenced is a personal research aid; no specific molecules, products, or proprietary technologies are discussed.

import { ResponsiveImage } from '@/components/ResponsiveImage'

The Day My App Got Audited By a Robot I Invited In

I built a personal scoring web app to evaluate candidate molecules against five formulation platforms. Eighteen axes, five platforms, ninety desirability curves, a Derringer-Suich D aggregator, a Methodology page that admits where the science is squishy. The tool is mine. I am the user.

Then, last Wednesday, I told Claude Code: evaluate Candidate X against Platform B.

What happened next was not "use the website." Claude opened my own browser via Playwright MCP, navigated to the live URL, found that the simulator accepted URL parameters for all 18 axes, hydrated state with one navigation, read the platform score (mid-range, a middling fit), then — and this is where my coffee got cold — opened the local repo, read the platform spec YAML, found the citation files, audited my methodology, flagged that one of the high-weight curves was calibrated to historical-cohort statistics rather than to the platform's actual physicochemical mechanism, and proposed a fix.

I said go. Twenty-eight minutes later, 305 tests pass, schema extended, ninety axis cells labeled with a provenance taxonomy, three UI surfaces updated, commit pushed to main, Cloudflare auto-deploy green.

The scoring app I built to help me think had, in one afternoon, become a tool another mind could use without asking me anything. That mind happened to be silicon. The same affordances would help any human power user, but the AI exposed which decisions were load-bearing and which were just decoration.

So this is a field guide. Seven design choices that turned my app from "scrollable" into "scriptable by an LLM." Every step has a worked example, a what-not-to-do, and a self-test.

Who this is for: anyone shipping a web tool that they want Claude (Chrome extension, Claude Code, Codex, whichever flavor) to use, not just describe. Side-project builders, internal-tool authors, devs who want a real-life QA partner. If you ever wished your power user could read the code, this is the world you build for them.


The Inflection Moment: One URL, Eighteen Values

Before I list the steps, here is the single biggest realization, in one paragraph.

The simulator page accepts URL parameters for every input axis. ?mw=427.8&logp=5.5&solubility=0.01&t_half=96… — eighteen of them, parsed by a Zod schema on mount, hydrated into reactive state. That one feature collapsed an interaction that would have required Claude to find, click, type, blur eighteen separate fields into a single browser_navigate call. Eighteen DOM hunts became one URL. The difference between Claude-as-toddler (poke each button, hope) and Claude-as-pianist (one chord, full state) was a searchParams.get() and a 30-line schema.

Hold that image. Now the steps.


Rule: every meaningful UI state your tool can be in should be reachable by URL alone. No clicks, no localStorage, no logged-in session, no toast banner that says "please re-enter X."

Why: AI agents (and your future self three weeks later) cannot navigate a stateful flow reliably. They can navigate a URL.

Example (worked): my /calculator route reads ?candidate=demo&mw=427.8&logp=5.5&… via a Zod schema. Each axis has a typed parser. Discrete enums (ip=active|expired|<=5y) are URL-encoded carefully — and I learned the hard way that <=5y is a terrible discrete value choice because %3C%3D5y makes URL parsers cry. Rename to active | expiring | expired. Spare yourself an afternoon.

Anti-pattern: "Click to apply" buttons whose effect is only stored in component state. From outside the app it looks like nothing happened.

Self-test: can a stranger, given just a URL, see the exact same screen you see? If no, fix it before Step 2.


Step 2 — Tag Every Interactive Element with a Stable data-testid

Rule: anything Claude (or Cypress, or you-in-three-weeks) might want to find — buttons, inputs, score cells, tabs — gets a data-testid="kebab-case-thing". Stable across releases. Not a Tailwind class, not a generated CSS hash, not "the third <div> under the second card."

Why: AI screen-scraping a brittle DOM is sad. AI selecting [data-testid="desirability-cell"] is cheerful. The DOM is your tool's API for non-human users.

Example (worked): my Methodology page has [data-testid="provenance-breakdown"], [data-testid="provenance-matrix"], [data-testid="axis-refs-cell"]. When I asked Claude to verify the change rendered, the verification was three document.querySelector calls. Twelve seconds.

Anti-pattern: "We use semantic HTML, so we don't need testids." Semantic HTML is great for screen readers, awful for surgical robot precision. You need both.

Self-test: open DevTools. document.querySelectorAll('[data-testid]').length — is it big? Bigger than 30 on a complex page? Good.


Step 3 — Keep the Computation Engine Pure

Rule: the function that turns inputs into outputs — your scoring engine, your renderer, your transformer — is pure. No fetch, no Date.now(), no random, no DOM access. Inputs in, outputs out.

Why: Claude can read pure functions and reason about them. It can hand-verify outputs. It cannot reason about a function that secretly calls an API and gets different results depending on the wind direction.

Example (worked): my aggregate(d, w, excluded, totalAxes) is 25 lines, no side effects. Claude computed the candidate's score by hand from my YAMLs (D = 0.589) and matched it to the displayed 0.58. That confirmed the math was reliable. We could then focus on the methodology question.

Anti-pattern: scoring engines tangled with React state, network fetches, or "if user is logged in, use different weights." Now your "score" is a quantum superposition of context.

Self-test: can the engine run in node --eval? If no, you have a hidden dependency.


Step 4 — Make Configuration a File, Not a Database

Rule: the knobs that change your tool's behavior (weights, thresholds, taxonomies, lookup tables) live in version-controlled files — YAML, JSON, TOML. Not in a database, not in environment variables, not in a CMS.

Why: files are grep-able, cat-able, git diff-able. Claude can read them, modify them, and propose a diff. A database is a fortress with a key card.

Example (worked): my five data/platforms/*.yaml files are 130 lines each, fully transparent. When I asked Claude to add a basis: label to all 90 axis cells, it read the files, classified each, wrote five new files. Done. If those values had been in Postgres, the same task would have required an admin UI or raw SQL — and the version history would live in nobody's git log.

Anti-pattern: "We'll make it editable in production, so it goes in the database." Now your config is no longer reproducible, and your AI collaborator is locked out.

Self-test: can your config be cloned, branched, and PR'd? If no, why not?


Step 5 — Provide JSON Export with the Same Shape as Input

Rule: if your tool computes something, it can export it. The export format should be importable. Round-trip identity: import(export(state)) === state.

Why: Claude can read the export. Claude can construct an import. Now your tool is composable with everything else Claude touches.

Example (worked): my simulator has an Export JSON button that dumps the full FieldEnvelope shape — the same shape my candidate YAMLs use. So "Claude, take this simulator output and create a candidate file" is one transformation, not a translation problem.

Anti-pattern: "Export to PDF" as the only output. PDFs are write-only. You have given the AI a postcard.

Self-test: can you copy the export, modify three values, and paste-import it back? If you have to fight the format, your users do too.


Step 6 — Build a Methodology Page (Your Own Paper Trail)

Rule: somewhere in your app, list every assumption, threshold, and weight, with the citation or the honest admission of "expert consensus." Yes, this feels excessive. Do it anyway.

Why: when Claude (or you, in three months, or the auditor) wants to challenge a result, the first question is where did this number come from? If the answer is in the app, the conversation is grounded. If the answer is in someone's head, the conversation is a séance.

Example (worked): my /methodology page enumerates all 18×5 = 90 cells, with refs or a consensus: true admission. After this session, each cell also wears a basis: badge — mechanism, platform-clinical, class-statistics, or consensus. When Claude told me "this platform's logP curve is class-statistics, not mechanism, and that's why the candidate looks worse than the underlying chemistry would predict," I could see that on screen. The methodology page is not a documentation page. It is a debate-enabling surface.

Anti-pattern: "Trust me, the numbers are right." You are one personnel change away from a black box.

Self-test: can a reviewer disagree with a specific parameter and point to where it lives in the UI? If no, you don't have a methodology page, you have a marketing page with charts.


Step 7 — Treat Tests as the I/O Contract

Rule: your test suite is the executable specification of how your tool behaves. Claude reads tests faster than it reads docs. Tests don't lie when code drifts. Documentation does.

Why: when Claude proposes a change, the first guard rail is "do the existing tests still pass?" 305 green checks is a stronger statement than "the README still describes it accurately."

Example (worked): my ConsensusFooter test asserts consensus-count reads "54 cells." When I added the new Provenance breakdown section, I had to preserve that contract — meaning the new feature lived above the old section rather than replacing it. The test was the constraint that kept backward compatibility. The contract was machine-readable.

Anti-pattern: "We'll add tests later." Later is a country no one visits. Without tests, every AI change is a roll of the dice.

Self-test: if you delete a feature and run tests, do tests fail loudly? If they pass silently, the feature wasn't actually covered.


Quick Round-Up Table

Step One-line rule One-question check
1 URL hydrates state Can a stranger reach this screen with a URL?
2 data-testid everywhere Is every interactive element findable by stable selector?
3 Pure engine Does it run in node --eval?
4 Config in files Can it be cloned and PR'd?
5 Symmetric JSON I/O Can I round-trip export → import?
6 Methodology page Can a reviewer disagree with a specific parameter and find it?
7 Tests as contract Does deleting a feature break a test?

Print it. Tape it next to the standing desk. We'll meet again here in six months.


Things That Did NOT Work (the Greatest Hits)

The <=5y URL-param fiasco. Picking discrete values that look great in YAML but become %3C%3D5y in a URL was a special kind of self-own. My URL parser rejected it, silently dropped the param, and Claude scored the candidate under a different IP-state bucket for an entire iteration before I noticed. Lesson: if it has special characters, it does not belong in a URL discrete enum. Rename.

d= values displayed in the simulator turned out to be best-fit-platform-specific, not per-platform. Claude spent an extra ten minutes confused about why varying t_half did not seem to change the displayed desirability — because the displayed d= was always Oil's. The fix is on my backlog: hover-reveal all five platform d= values. The lesson is more general: if a single number on screen secretly refers to one of several things, your AI collaborator will make defensible-but-wrong inferences. Disambiguate.

The five-yaml mass edit nearly fabricated values. I asked Claude to label all 90 axis cells. If I had not pre-classified each ref by hand (mechanism / platform_clinical / class_statistics / consensus), the agent would have plausibly invented categorizations for unfamiliar references. Specifying the taxonomy before delegation is the difference between data labeling and creative writing. I learned that one the hard way on a different project. This time, taxonomy first.


Chrome-Extension Claude vs. Claude Code — Quick Notes

Chrome-extension Claude (the in-browser side panel) can see your page, click buttons, fill forms. It is great for one-off "evaluate this candidate" tasks where the user is right there. It cannot read your local repo. So Steps 1–5 matter intensely, Steps 6–7 matter for your sanity but not the AI's flow.

Claude Code (terminal/CLI) can read your repo, propose schema changes, run tests, push commits. For Claude Code, Steps 6–7 unlock the loop where the AI is not just a user of your tool but a co-author of it. (If you want the dual-AI version of that loop — Claude planning, Codex reviewing — there is a separate post on the FRIDAY workflow I wrote a month back.) If your tool only ever needs to be used, design for Chrome. If your tool needs to evolve, design for Code.

The same web frontend can serve both — and after this session I am convinced that designing for Claude Code is the harder, more rewarding target. Get it right, and a Chrome-extension Claude (or any future AI assistant) inherits the affordances for free.


Closing — The Affordance Compound Interest

Here is the punch line. The seven steps above are not "AI-specific." Every one of them is good engineering hygiene that humans have wanted for years. URL-driven state, stable test IDs, pure functions, files-not-databases, JSON round-trips, methodology pages, executable tests. We have always known these were better. We just rarely got around to all seven.

What changed is the cost-of-not-doing. With humans, skipping Step 6 means slow onboarding for your next teammate. Skipping Step 1 means your support team gets more "how do I get back to that screen" tickets. Annoying, but survivable.

With AI agents as users, skipping any of these turns "Claude, evaluate this candidate" from a one-call task into a forty-call DOM-archaeology expedition. Each missing affordance compounds into wasted tokens, wasted minutes, wasted trust. Conversely, each affordance compounds the other way: once you have URL hydration and testids and pure scoring, Claude does not just navigate your tool — it audits your tool, fixes your tool, deploys your tool, and writes the blog post about doing it. (Ahem.)

Build the tool you would want to hand to your most capable, most literal, most patient collaborator. Then hand it to one.


The provenance breakdown described in Step 6 is publicly viewable at lai.vibed-lab.com/methodology. The repository for this scoring app is private; the architectural patterns above are the transferable part.

2026.05.14

Written by

Jay Lee

Korea-Licensed Pharmacist (#68652) · Senior Researcher

Korea University, College of Pharmacy (B.S. + M.S., drug delivery systems & industrial pharmacy). Building production-grade AI tools across medicine, finance, and productivity — without a CS degree. Domain expertise first, code second.

About the author →
ShareX / TwitterLinkedIn