vibed lab · leaderboard

Coding LLM leaderboard.

Two public benchmarks, ten models, scored verbatim from the source. Every refresh is date-stamped below. No editorial massaging on the numbers — only the vibe notes below are human-written.

Generated: 2026-07-12

What we measure

Artificial Analysis

metric: Intelligence Index

What: Artificial Analysis Intelligence Index — a composite of reasoning, math, science, and knowledge evals across the whole model field.
High =: Smart across the board, and it tracks brand-new frontier models fast — often before any coding benchmark has measured them.
In code: A general quality baseline, not coding-specific. It's the only column that covers the newest models (GPT-5.5, Grok 4.3), so use it to place models the coding column can't score yet.

source →

SWE-bench Verified

metric: Resolved (%)

What: Real GitHub bug fixes — the model gets a repo + issue and must produce a patch that passes the project's tests. Numbers here are vendor-reported.
High =: Can navigate a large codebase, find the bug, and write a minimal correct patch — the hardest agentic coding task. 80%+ is frontier.
In code: The headline coding signal. Caveat: SWE-bench Verified is gameable (Berkeley RDI, 2026), so read it as a vendor-reported claim, not ground truth. The newest models show — until aggregators add them.

source →

Frontier (closed-weight)

Model	Provider	Artificial Analysis	SWE-bench Verified
GPT-5.5	OpenAI	55.0	—
Claude Opus 4.7	Anthropic	54.0	87.6
Gemini 3.1 Pro	Google	46.0	80.6
Grok 4.3	xAI	38.0	—
Claude Sonnet 4.6	Anthropic	43.0	79.6

GPT-5.5OpenAI

Artificial Analysis: 55.0
SWE-bench Verified: —

Claude Opus 4.7Anthropic

Artificial Analysis: 54.0
SWE-bench Verified: 87.6

Gemini 3.1 ProGoogle

Artificial Analysis: 46.0
SWE-bench Verified: 80.6

Grok 4.3xAI

Artificial Analysis: 38.0
SWE-bench Verified: —

Claude Sonnet 4.6Anthropic

Artificial Analysis: 43.0
SWE-bench Verified: 79.6

Open-weight (OSS)

Model	Provider	Artificial Analysis	SWE-bench Verified
DeepSeek-V4-Pro	DeepSeek	61.0	80.6
Kimi K2.6	Moonshot AI	44.0	80.2
Qwen3.6 27B	Alibaba	37.0	77.2
MiMo-V2-Flash	Xiaomi	25.0	73.4
DeepSeek-V4-Flash	DeepSeek	100.0	79.0

DeepSeek-V4-ProDeepSeek

Artificial Analysis: 61.0
SWE-bench Verified: 80.6

Kimi K2.6Moonshot AI

Artificial Analysis: 44.0
SWE-bench Verified: 80.2

Qwen3.6 27BAlibaba

Artificial Analysis: 37.0
SWE-bench Verified: 77.2

MiMo-V2-FlashXiaomi

Artificial Analysis: 25.0
SWE-bench Verified: 73.4

DeepSeek-V4-FlashDeepSeek

Artificial Analysis: 100.0
SWE-bench Verified: 79.0

How this page is built

Scores are pulled verbatim from each benchmark's public leaderboard — Artificial Analysis (Intelligence Index) and SWE-bench Verified — rendered in a headless browser and read mechanically from the table, never transcribed by hand.
A refresh job fetches both sources, runs a fabrication-check gate (every number must trace back to the source row), and commits a fresh leaderboard.json — the “Generated” date above is the timestamp of the latest run. The numbers reflect the 2026-07-12 snapshot; refreshes are run manually, not on an automated schedule.
Vibe notes (when present) are the only editorial layer — short operator commentary, kept across refreshes.
Missing scores (—) mean the model wasn't yet evaluated by that benchmark on its most recent run.