vibed lab · leaderboard
Coding LLM leaderboard.
Two public benchmarks, ten models, scored verbatim from the source. Refreshed weekly. No editorial massaging on the numbers — only the vibe notes below are human-written.
Generated: 2026-05-30
What we measure
Artificial Analysis
metric: Intelligence Index
- What
- Artificial Analysis Intelligence Index — a composite of reasoning, math, science, and knowledge evals across the whole model field.
- High =
- Smart across the board, and it tracks brand-new frontier models fast — often before any coding benchmark has measured them.
- In code
- A general quality baseline, not coding-specific. It's the only column that covers the newest models (GPT-5.5, Grok 4.3), so use it to place models the coding column can't score yet.
source →
SWE-bench Verified
metric: Resolved (%)
- What
- Real GitHub bug fixes — the model gets a repo + issue and must produce a patch that passes the project's tests. Numbers here are vendor-reported.
- High =
- Can navigate a large codebase, find the bug, and write a minimal correct patch — the hardest agentic coding task. 80%+ is frontier.
- In code
- The headline coding signal. Caveat: SWE-bench Verified is gameable (Berkeley RDI, 2026), so read it as a vendor-reported claim, not ground truth. The newest models show — until aggregators add them.
source →
Frontier (closed-weight)
| Model | Provider | Artificial Analysis | SWE-bench Verified | Note |
|---|---|---|---|---|
| GPT-5.5 | OpenAI | 60.0 | — | |
| Claude Opus 4.7 | Anthropic | 57.0 | 87.6 | |
| Gemini 3.1 Pro | 57.0 | 80.6 | ||
| Grok 4.3 | xAI | 53.0 | — | |
| Claude Sonnet 4.6 | Anthropic | 52.0 | 79.6 |
Open-weight (OSS)
| Model | Provider | Artificial Analysis | SWE-bench Verified | Note |
|---|---|---|---|---|
| DeepSeek-V4-Pro | DeepSeek | 52.0 | 80.6 | |
| Kimi K2.6 | Moonshot AI | 54.0 | 80.2 | |
| Qwen3.6 27B | Alibaba | 46.0 | 77.2 | |
| MiMo-V2-Flash | Xiaomi | 41.0 | 73.4 | |
| DeepSeek-V4-Flash | DeepSeek | 47.0 | 79.0 |
How this page is built
- Scores are pulled verbatim from each benchmark's public leaderboard — Artificial Analysis (Intelligence Index) and SWE-bench Verified (via llm-stats) — rendered in a headless browser and read mechanically from the table, never transcribed by hand.
- A weekly cron job fetches both sources, runs a fabrication-check gate (every number must trace back to the source row), and commits a fresh
leaderboard.json. - Vibe notes (when present) are the only editorial layer — short operator commentary, never overwritten by the cron.
- Missing scores (
—) mean the model wasn't yet evaluated by that benchmark on its most recent run.