Methodology

How the number is built

Estimating the cost of an AI-agent build is not like counting lines of code. This page shows exactly how we go from a project description to a dollar range — and, just as importantly, why we refuse to give you a single number.

Why it's always a range

Agentic token spend has 10×+ run-to-run variance. Give an agent the same task twice and it can take wildly different paths — academic work puts the pre-build predictability at r<0.15, effectively noise. A single number would be false precision, and false precision destroys trust. So every estimate is three numbers: Low, Expected, and High.

Scope → tokens → dollars

The bridge nobody else builds. Token calculators only price a pasted string; agency calculators only give hours × rate. We connect project scope to token volume to dollars:

tokens = Σ_features [ loop_mult × (base_context × density + overhead_per_turn) ]
         adjusted by cache_hit_rate
cost   = tokens × blended_price(model)
→ Low / Expected / High

loop_mult — plan→edit→verify→retry cycles per feature. Reliability falls as tasks get harder, so retries (and tokens) climb non-linearly with complexity.
density — architectural coupling, from ~1.0 (isolated frontend) to ~2.5 (deeply coupled enterprise).
overhead_per_turn — system prompt, tool schemas, and connected MCP servers (~18k tokens per turn, each). Non-productive but unavoidable.
cache_hit_rate — cache reads are over 97% of agentic token volume. This single assumption is the biggest swing between Low and High.

What each band assumes

Low — a senior operator, high test coverage feeding an automated verify loop, strict session hygiene, high cache hit rate. The optimized regime (how vinext hit $1,100).

Expected — standard interactive development with moderate cache. Where most real projects land.

High — unoptimized debug loops, verbose logging streamed back into context, no resets. The runaway regime (how one build hit $30,983 in a month).

Token bands by project type

Raw agentic token volume per band, before the model's tokenizer multiplier. Anchored to the benchmark library below.

Project type	Low	Expected	High
Landing page CSS / layout retries	1M	5M	15M
Small CRUD app DB migrations + API route mapping	10M	35M	100M
SaaS MVP Auth loops, webhooks, multi-file synchronization	50M	150M	500M
Mobile app (Expo) Device-emulation + multi-OS compile checks	100M	340M	800M
Framework-scale Test-suite feedback, deep dependency graph	200M	600M	1.5B

Model pricing

USD per 1M tokens, verified against provider documentation. We keep this current — it's the one figure that dates fastest.

Model	Input	Output	Cache read	Tokenizer
Claude Opus 4.8	$5	$25	$0.5	×1.35
Claude Sonnet 4.6	$3	$15	$0.3	×1
Claude Haiku 4.5	$1	$5	$0.1	×1
OpenAI GPT-5.4	$3	$15	$0.25	×1
Gemini 3.1 Pro	$2	$12	$0.2	×1

Tokenizer note: Claude Opus 4.7 and later use a new tokenizer that consumes up to ~35% more tokens for the same text. That's a token-count multiplier, not just a price change — the same scope genuinely costs more tokens on the latest Opus, so we model it explicitly.

What we anchor to

Estimates are only as honest as the real builds behind them. These are the documented outcomes our bands are calibrated against.

vinext ↗

~$1,100

94% of the Next.js API surface, rebuilt on Vite; ~800 sessions

The LOW regime. A comprehensive public test suite gave the agent an automated verify loop, and an expert operator kept it off unviable directions — the two things that separate a $1,100 build from a $30,000 one.

Tokenflex ↗

$30,983

Leaderboard web app; 51,414 API events, ~17B tokens in 30 days

The HIGH regime. Continuous sessions with no context resets let verbose terminal output stream back into the prompt, creating a geometric cost curve. The cautionary tale every High band is calibrated against.

Memo iOS app

~340M tokens (~$1,020)

Native memo app; core utilities + UI components

Context decay over a long build: even minor changes made the agent re-analyze the codebase, inflating cumulative spend. Anchors the mobile Expected band.

FeatureBench ↗

~9.7M input tokens / feature

End-to-end, feature-level coding tasks

SOTA models solved only ~11% of feature-level tasks despite high bug-fix scores. Proof that building new features burns tokens on retries — and the anchor for feature-count mode.

An honest caveat. These are industry-composite estimates for planning, not quotes. Your actual spend depends on your operator skill, your test coverage, your session hygiene, and a measure of luck. We'd rather give you an honest range you can plan around than a precise number you can't trust.

← Back to the estimator