Benchmark library

What AI builds actually cost

Every estimate here is calibrated against real, documented builds — not vendor marketing. These are the anchors: the optimized $1,100 framework rebuild, the $30,983 runaway month, and the messy middle. The gap between them is the whole point.

vinext ↗

~$1,100

Framework rebuild ~1 week Claude API (Opus 4.5/4.6) LOW anchor

94% of the Next.js API surface, rebuilt on Vite; ~800 sessions

The LOW regime. A comprehensive public test suite gave the agent an automated verify loop, and an expert operator kept it off unviable directions — the two things that separate a $1,100 build from a $30,000 one.

Tokenflex ↗

$30,983

Web app (runaway) 1 month Claude Code, Codex, Synthetic HIGH anchor

Leaderboard web app; 51,414 API events, ~17B tokens in 30 days

The HIGH regime. Continuous sessions with no context resets let verbose terminal output stream back into the prompt, creating a geometric cost curve. The cautionary tale every High band is calibrated against.

Memo iOS app

~340M tokens (~$1,020)

Mobile app 5 months Claude API EXPECTED anchor

Native memo app; core utilities + UI components

Context decay over a long build: even minor changes made the agent re-analyze the codebase, inflating cumulative spend. Anchors the mobile Expected band.

FeatureBench ↗

~9.7M input tokens / feature

Academic benchmark per task Claude 4.5 Opus + others

End-to-end, feature-level coding tasks

SOTA models solved only ~11% of feature-level tasks despite high bug-fix scores. Proof that building new features burns tokens on retries — and the anchor for feature-count mode.

Prototype feature

~200k tokens (~$4.50)

Isolated component 3 weeks Blended models

Single feature addition / isolated component build

The floor: a well-scoped, isolated change against a small surface costs almost nothing. Most of the spread above this is context size and retry loops, not the code itself.

Have a documented build to add? The plan is a public, growing library — and eventually a leaderboard where you upload your own /usage export to see where your burn rate lands.

← Run an estimate