Building a Dual-AI Workflow That Plans, Builds, and Reviews Its Own Code

After installing OpenAI's Codex plugin for Claude Code (that's the previous post), I wanted more. The official plugin gives you review and task delegation. But the workflow — when to review, when to delegate, how to route, what to enforce — that's on you.

So I went looking for someone who had already figured this out. Found two repos. Liked different parts of each. Combined them. Built FRIDAY.

This is the story of a three-hour Frankenstein session that produced an autonomous development agent from 19 markdown files.

Note on tooling: I built this on Claude Code with the OpenAI Codex plugin because that's my current setup. The same dual-AI workflow ideas — forced review gates, smart routing by task size, mandatory adversarial second opinions — apply equally to Cursor + GPT, Aider + Sonnet, or any AI coding stack with two distinct models. The principles below are tool-agnostic; only the implementation file paths are Claude-specific.

The Two Repos I Started From

1. claude-review-loop (640 stars)

By hamelsmu. The killer feature: a stop hook that physically prevents Claude from exiting until a Codex review exists.

Here's how it works. Claude Code has a "Stop" hook — a script that runs when the agent tries to finish. This plugin hijacks that hook. When Claude says "I'm done," the hook checks: is there a review file on disk? No? Then you're not done. The hook returns {"decision": "block"} and Claude is forced to run the review first.

It's the AI equivalent of a bouncer checking your stamp before you leave. You're not going anywhere until the review is done.

The hook also spawns up to 4 parallel Codex agents — diff review, holistic review, framework-specific review, UX review — and consolidates their findings into a single file. Aggressive. I liked it.

What I didn't like: it's review-only. No planning phase. No smart routing. No TDD.

2. claude-codex (7 stars)

By ching-kuo. Almost nobody had found this one yet, but the architecture was surprisingly complete. It has:

Smart routing: changes ≤2 files and ≤30 lines → Claude implements directly. Anything bigger → Codex implements via MCP.
Plan → Implement → Review cycle: Codex audits the plan before you build. Then reviews the implementation after.
TDD workflow: write failing tests (RED), Codex audits test quality, implement (GREEN), refactor, Codex reviews.
Critical Evaluation Protocol: don't blindly accept all Codex feedback. Challenge incorrect findings. Discussion doesn't count against the iteration limit.

What I didn't like: no enforcement mechanism. The review is "mandatory" in the instructions, but there's nothing stopping Claude from skipping it. It's a rule, not a lock.

The Obvious Move

Take the enforcement from repo 1. Take the workflow from repo 2. Combine them into something that does everything.

Anatomy of the dx Plugin

I created a Claude Code plugin called dx (for "dual-cross") with these commands:

Command	What It Does
`/dx:plan`	Claude creates a plan → Codex audits it (max 3 rounds) → saves to `.claude/plan/`
`/dx:build`	Smart routing implementation → mandatory Codex review (stop hook enforced)
`/dx:review`	Standalone Codex review with `--adversarial` option
`/dx:tdd`	RED → Codex audit → GREEN → Refactor → Codex review
`/dx:friday`	Autonomous agent that chains all of the above

The whole thing is 19 files. Zero lines of executable code except one bash script (the stop hook). Everything else is markdown instructions.

plugins/dx/
├── .mcp.json                    ← Codex MCP server auto-start
├── AGENTS.md                    ← Agent guidelines
├── agents/friday.md             ← The autonomous orchestrator
├── commands/                    ← 6 slash commands
│   ├── plan.md
│   ├── build.md
│   ├── review.md
│   ├── tdd.md
│   ├── friday.md
│   └── activate.md
├── hooks/
│   ├── hooks.json               ← Stop hook registration
│   └── stop-hook.sh             ← The bouncer
├── prompts/                     ← Codex role injection
│   ├── analyzer.md              ← Plan auditor persona
│   ├── architect.md             ← Implementation persona
│   └── reviewer.md              ← Code reviewer persona
└── skills/                      ← Auto-loaded knowledge
    ├── codex-mcp/SKILL.md       ← How to talk to Codex
    ├── smart-routing/SKILL.md   ← When Claude vs. Codex
    └── critical-eval/SKILL.md   ← How to evaluate feedback

Let me walk through the interesting design decisions.

Smart Routing: Who Implements What?

This was the idea I liked most from the claude-codex repo. Not everything needs Codex. If you're adding a CSS class to one file, spinning up an MCP server, making an API call, waiting for Codex to write the file — that's overkill. Claude can do that faster with a direct Edit tool call.

The rule is simple:

Condition	Route
≤ 2 files AND ≤ 30 lines AND no new patterns	Route A: Claude implements directly
> 2 files OR > 30 lines OR new architecture	Route B: Codex implements via MCP

In TDD mode, test files are excluded from the count. Only production code matters for routing. And Codex is never allowed to modify test files — only Claude writes tests. This prevents the two AIs from fighting over the same files.

Route B works through the Codex MCP server. Claude reads the architect.md prompt, injects it as developer-instructions, tells Codex "here's the plan, you have write access, go." Claude's job in Route B is context retrieval and review. It doesn't touch Edit or Write. Clear separation of concerns.

The Stop Hook That Won't Let You Skip Review

This is the part I'm most proud of stealing — I mean, being inspired by.

The stop hook creates a two-phase lock:

Phase 1 (task): Claude finishes implementation. Tries to exit. Hook fires. Checks: is dx-workflow-state.md active? Yes? Then block exit. Tell Claude: "Run the Codex review. Save results to reviews/dx-review-<id>.md. Then you may leave."

Phase 2 (addressing): Claude runs the review, tries to exit again. Hook fires again. Checks: does the review file exist on disk? Yes → approve exit. No → block again (max 2 retries, then fail-open).

The fail-open part is critical. If the hook script crashes — bad JSON, missing jq, whatever — it always defaults to {"decision": "approve"}. Never trap the user. A broken hook that locks you in a loop is worse than no hook at all.

# ERR trap: always fail-open
trap 'printf "{\"decision\":\"approve\"}\n"; exit 0' ERR

One line. Worth everything.

I also added path traversal prevention on the review ID. The ID format is YYYYMMDD-HHMMSS-hexhex and gets validated with a regex before it's used in any file path. Paranoid? Maybe. But the review ID comes from a shell command, and shell commands plus file paths are where injection bugs live.

FRIDAY: Chaining It All Together

Here's where things get fun.

Each command — plan, build, review, tdd — works independently. But the real workflow chains them:

/dx:plan → user approves → /dx:build → tests → review → commit

Doing this manually means typing four commands and making decisions at each step. So I built an agent that does it automatically. I named it FRIDAY, because JARVIS was taken (by my own previous post, ironically).

FRIDAY's pipeline:

INTAKE → PLAN → GATE → BUILD → VERIFY → REVIEW → DELIVER

The agent auto-detects task type (feature/bugfix/refactor), picks the right mode (standard vs. TDD — it uses TDD if you mention tests or if the project already has >60% test coverage), runs the whole pipeline, and only stops twice to ask you anything:

GATE — "Here's the plan. Proceed?" (You must approve before implementation starts.)
DELIVER — "Here are the results. Commit?" (You must approve before anything is committed.)

Everything between those two gates is autonomous. FRIDAY picks the route, implements, runs tests, fixes failures, triggers Codex review, evaluates findings, fixes valid issues, challenges invalid ones, saves the review file — all without asking.

The decision framework is explicit:

FRIDAY decides alone	FRIDAY asks you
Code style (matches existing)	Deleting files
Internal implementation details	Changing public APIs
Trivial Codex findings (LOW)	Adding dependencies
Test organization	Security decisions
Borderline finding, fix is <5 lines	BLOCKED after 3 iterations

It's the balance I've been chasing since Jarvis Mode v1. Full autonomy where it's safe. Human judgment where it matters.

Don't Trust Everything Codex Says

This is the thing most people miss about dual-AI workflows. Codex is a reviewer, not an oracle. It makes mistakes. It flags things that aren't real problems. It misses context that Claude has from reading your project.

The Critical Evaluation Protocol is a skill file that loads automatically:

For each Codex finding:

Is it technically correct?

Does Codex have enough context to judge?

Does fixing it actually improve the code?

If incorrect: challenge via codex-reply with evidence. Do NOT fix. If ambiguous: ask the user. If correct: fix it.

The key insight: challenging a finding doesn't count as a review iteration. Only fix-and-resubmit cycles count toward the 3-iteration limit. This prevents the situation where Codex says something wrong, Claude wastes an iteration "fixing" a non-problem, and now you've burned 1 of 3 chances on nothing.

It also prevents the opposite failure: Claude blindly accepting every Codex suggestion, including the bad ones, and the code getting worse through over-compliance. Claude is the architect. Codex is the reviewer. The architect listens to reviews but doesn't obey them unconditionally. (I wrote a whole post on fixing AI over-correction — the same failure mode shows up everywhere a model treats feedback as a command.)

Installing the Plugin

The plugin lives at D:\coding\claude-codex-workflow. To use it in your own setup:

Clone or copy the directory somewhere permanent.

{
  "dx-workflow": {
    "source": { "source": "local", "path": "/path/to/claude-codex-workflow" },
    "installLocation": "/path/to/claude-codex-workflow"
  }
}

Enable in ~/.claude/settings.json:

{ "enabledPlugins": { "dx@dx-workflow": true } }

Restart Claude Code or run /reload-plugins.

Make sure Codex CLI is installed and authenticated:

npm install -g @openai/codex
codex login status

The .mcp.json file in the plugin auto-starts the Codex MCP server when the plugin loads. No extra claude mcp add needed.

What I Took From Where

Feature	Source Repo	What I Changed
Stop hook enforcement	claude-review-loop	Added Windows/Git Bash compat, jq-optional fallback
Fail-open on errors	claude-review-loop	Same principle, new implementation
Smart routing	claude-codex	Extracted as standalone skill, shared across commands
Plan → Build → Review	claude-codex	Split into independent commands + combined in FRIDAY
TDD flow	claude-codex	Added stop hook enforcement
Critical Evaluation	claude-codex	Extracted as shared skill
Adversarial review	codex-plugin-cc (OpenAI)	Added as flag on `/dx:review`
Autonomous orchestrator	My own Jarvis Mode v5	Rebuilt as FRIDAY with dual-model gates

Nothing here is original. Every piece came from someone else's repo. I just stitched them together in a way that made sense to me and saved myself from typing four commands every time I want to ship something.

Tool-Agnostic Takeaways

If you're using Cursor, Aider, Continue, or any other AI coding setup, the same patterns apply:

Forced review gate: a hook or pre-commit step that physically blocks the "done" state until a different model has reviewed the diff. Claude Code uses Stop hooks; Cursor users can wire this through pre-commit hooks; Aider users can chain commands. The mechanism varies; the discipline is identical.
Smart routing by size: small mechanical changes go to whichever model is cheaper or faster; architectural decisions and large diffs go to the model with stronger reasoning. This is a script, not a feature — write it for whatever stack you have.
Adversarial second opinion: the value isn't "code review" generically. It's deliberately hostile review that hunts for failure modes. Most AI tools default to polite. You configure the prompt to be combative.
TDD loop with audit: write the failing test, have a different model audit the test before you implement, then implement, then have the second model review the implementation. The audit-the-test step is the underrated one.

The plugin happens to live in plugins/dx/ because that's Claude Code's filesystem convention. The structure underneath — markdown files for skills, JSON for hooks, MCP servers for inter-tool communication — is portable in spirit even if not in syntax.

The Real Lesson: Workflows as Plain Text

The most powerful thing about Claude Code's plugin system isn't that it lets you code new features. It's that it lets you describe workflows in plain text and the AI follows them. The same idea powers Claude Code subagents and skills that trigger themselves — markdown in, behavior out.

FRIDAY is 19 files. One bash script. The rest is markdown. There are more words in this blog post than in the entire plugin. And yet it orchestrates a pipeline that:

Plans with codebase awareness
Gets audited by a different AI model
Routes implementation by change size
Enforces mandatory cross-model review with a filesystem lock
Runs TDD with test ownership rules
Makes 90% of decisions autonomously
Escalates the right 10% to the human

All of that. In markdown files. In folders.

I keep saying this, but it keeps being true: the most complex AI systems I build contain zero lines of real code. Just very specific instructions about what to do, what not to do, and what to prove before claiming you're done.

Resources

openai/codex-plugin-cc — OpenAI's official Codex plugin for Claude Code
hamelsmu/claude-review-loop — Stop hook forced review (640 stars, the enforcement inspiration)
ching-kuo/claude-codex — Smart routing + TDD workflow (7 stars, the workflow inspiration)
My Jarvis Mode post — The previous version of this idea, single-model
Claude Code Plugin Docs — Official documentation

If you build something better than FRIDAY using these pieces, tell me. I'll steal that too.

2026.04.10