Methodology
How the number is built
Estimating the cost of an AI-agent build is not like counting lines of code. This page shows exactly how we go from a project description to a dollar range — and, just as importantly, why we refuse to give you a single number.
Why it's always a range
Agentic token spend has 10×+ run-to-run variance. Give an agent the same task twice and it can take wildly different paths — academic work puts the pre-build predictability at r<0.15, effectively noise. A single number would be false precision, and false precision destroys trust. So every estimate is three numbers: Low, Expected, and High.
Scope → tokens → dollars
The bridge nobody else builds. Token calculators only price a pasted string; agency calculators only give hours × rate. We connect project scope to token volume to dollars:
tokens = Σ_features [ loop_mult × (base_context × density + overhead_per_turn) ]
adjusted by cache_hit_rate
cost = tokens × blended_price(model)
→ Low / Expected / High - loop_mult — plan→edit→verify→retry cycles per feature. Reliability falls as tasks get harder, so retries (and tokens) climb non-linearly with complexity.
- density — architectural coupling, from ~1.0 (isolated frontend) to ~2.5 (deeply coupled enterprise).
- overhead_per_turn — system prompt, tool schemas, and connected MCP servers (~18k tokens per turn, each). Non-productive but unavoidable.
- cache_hit_rate — cache reads are over 97% of agentic token volume. This single assumption is the biggest swing between Low and High.
What each band assumes
Low — a senior operator, high test coverage feeding an automated verify loop, strict session hygiene, high cache hit rate. The optimized regime (how vinext hit $1,100).
Expected — standard interactive development with moderate cache. Where most real projects land.
High — unoptimized debug loops, verbose logging streamed back into context, no resets. The runaway regime (how one build hit $30,983 in a month).
Token bands by project type
Raw agentic token volume per band, before the model's tokenizer multiplier. Anchored to the benchmark library below.
| Project type | Low | Expected | High |
|---|---|---|---|
| Landing page CSS / layout retries | 1M | 5M | 15M |
| Small CRUD app DB migrations + API route mapping | 10M | 35M | 100M |
| SaaS MVP Auth loops, webhooks, multi-file synchronization | 50M | 150M | 500M |
| Mobile app (Expo) Device-emulation + multi-OS compile checks | 100M | 340M | 800M |
| Framework-scale Test-suite feedback, deep dependency graph | 200M | 600M | 1.5B |
Model pricing
USD per 1M tokens, verified against provider documentation. We keep this current — it's the one figure that dates fastest.
| Model | Input | Output | Cache read | Tokenizer |
|---|---|---|---|---|
| Claude Opus 4.8 | $5 | $25 | $0.5 | ×1.35 |
| Claude Sonnet 4.6 | $3 | $15 | $0.3 | ×1 |
| Claude Haiku 4.5 | $1 | $5 | $0.1 | ×1 |
| OpenAI GPT-5.4 | $3 | $15 | $0.25 | ×1 |
| Gemini 3.1 Pro | $2 | $12 | $0.2 | ×1 |
Tokenizer note: Claude Opus 4.7 and later use a new tokenizer that consumes up to ~35% more tokens for the same text. That's a token-count multiplier, not just a price change — the same scope genuinely costs more tokens on the latest Opus, so we model it explicitly.
What we anchor to
Estimates are only as honest as the real builds behind them. These are the documented outcomes our bands are calibrated against.
vinext ↗
~$1,10094% of the Next.js API surface, rebuilt on Vite; ~800 sessions
The LOW regime. A comprehensive public test suite gave the agent an automated verify loop, and an expert operator kept it off unviable directions — the two things that separate a $1,100 build from a $30,000 one.
Tokenflex ↗
$30,983Leaderboard web app; 51,414 API events, ~17B tokens in 30 days
The HIGH regime. Continuous sessions with no context resets let verbose terminal output stream back into the prompt, creating a geometric cost curve. The cautionary tale every High band is calibrated against.
Memo iOS app
~340M tokens (~$1,020)Native memo app; core utilities + UI components
Context decay over a long build: even minor changes made the agent re-analyze the codebase, inflating cumulative spend. Anchors the mobile Expected band.
FeatureBench ↗
~9.7M input tokens / featureEnd-to-end, feature-level coding tasks
SOTA models solved only ~11% of feature-level tasks despite high bug-fix scores. Proof that building new features burns tokens on retries — and the anchor for feature-count mode.
An honest caveat. These are industry-composite estimates for planning, not quotes. Your actual spend depends on your operator skill, your test coverage, your session hygiene, and a measure of luck. We'd rather give you an honest range you can plan around than a precise number you can't trust.
← Back to the estimator