Skip to content

vibed lab · leaderboard

Coding LLM leaderboard.

Two public benchmarks, ten models, scored verbatim from the source. Refreshed weekly. No editorial massaging on the numbers — only the vibe notes below are human-written.

Generated: 2026-05-30

What we measure

Artificial Analysis

metric: Intelligence Index

What
Artificial Analysis Intelligence Index — a composite of reasoning, math, science, and knowledge evals across the whole model field.
High =
Smart across the board, and it tracks brand-new frontier models fast — often before any coding benchmark has measured them.
In code
A general quality baseline, not coding-specific. It's the only column that covers the newest models (GPT-5.5, Grok 4.3), so use it to place models the coding column can't score yet.

source →

SWE-bench Verified

metric: Resolved (%)

What
Real GitHub bug fixes — the model gets a repo + issue and must produce a patch that passes the project's tests. Numbers here are vendor-reported.
High =
Can navigate a large codebase, find the bug, and write a minimal correct patch — the hardest agentic coding task. 80%+ is frontier.
In code
The headline coding signal. Caveat: SWE-bench Verified is gameable (Berkeley RDI, 2026), so read it as a vendor-reported claim, not ground truth. The newest models show — until aggregators add them.

source →

Frontier (closed-weight)

ModelProviderArtificial AnalysisSWE-bench VerifiedNote
GPT-5.5OpenAI
60.0
Claude Opus 4.7Anthropic
57.0
87.6
Gemini 3.1 ProGoogle
57.0
80.6
Grok 4.3xAI
53.0
Claude Sonnet 4.6Anthropic
52.0
79.6

Open-weight (OSS)

ModelProviderArtificial AnalysisSWE-bench VerifiedNote
DeepSeek-V4-ProDeepSeek
52.0
80.6
Kimi K2.6Moonshot AI
54.0
80.2
Qwen3.6 27BAlibaba
46.0
77.2
MiMo-V2-FlashXiaomi
41.0
73.4
DeepSeek-V4-FlashDeepSeek
47.0
79.0

How this page is built