Rosa Del Mar

Daily Brief

Issue 64 2026-03-05

Pricing Mechanics For Long-Context Usage

Issue 64 Edition 2026-03-05 5 min read
Not accepted General
Sources: 1 • Confidence: Medium • Updated: 2026-04-13 03:56

Key takeaways

  • GPT-5.4 pricing is slightly higher than the GPT-5.2 family, and both GPT-5.4 models cost more when usage exceeds 272,000 tokens.
  • GPT-5.4 outperforms GPT-5.3-Codex on relevant coding benchmarks.
  • On an internal benchmark of spreadsheet modeling tasks resembling junior investment banking analyst work, GPT-5.4 scored 87.3% versus 68.4% for GPT-5.2.
  • In one reported instance, generating an image with GPT-5.4 Pro took 4 minutes 45 seconds and cost $1.55.
  • It is currently unclear whether a GPT-5.4-Codex variant will be released or whether the Codex line has been merged into the main model family.

Sections

Pricing Mechanics For Long-Context Usage

  • GPT-5.4 pricing is slightly higher than the GPT-5.2 family, and both GPT-5.4 models cost more when usage exceeds 272,000 tokens.

Capability Crossover: General Model Vs Coding Specialist

  • GPT-5.4 outperforms GPT-5.3-Codex on relevant coding benchmarks.

Business Spreadsheet Modeling Performance Jump

  • On an internal benchmark of spreadsheet modeling tasks resembling junior investment banking analyst work, GPT-5.4 scored 87.3% versus 68.4% for GPT-5.2.

Image Generation Cost And Latency Constraints

  • In one reported instance, generating an image with GPT-5.4 Pro took 4 minutes 45 seconds and cost $1.55.

Product-Line Uncertainty: Codex Sku Vs Consolidation

  • It is currently unclear whether a GPT-5.4-Codex variant will be released or whether the Codex line has been merged into the main model family.

Watchlist

  • It is currently unclear whether a GPT-5.4-Codex variant will be released or whether the Codex line has been merged into the main model family.

Unknowns

  • What are the published per-token rates for GPT-5.4 at or below the 272,000-token threshold versus above it, and how exactly is the threshold applied (per request, per day, per billing interval, or another unit)?
  • Which specific coding benchmarks support the claim that GPT-5.4 outperforms GPT-5.3-Codex, and what are the evaluation details (task mix, constraints, scoring, variance)?
  • Will there be a GPT-5.4-Codex (or equivalent coding-specialized) SKU, and if so, how will it differ in capability, price, and limits from GPT-5.4?
  • Is the spreadsheet modeling benchmark result reproducible in third-party evaluations, and how does it translate into real-world spreadsheet/model error rates and correction burden?
  • What is the typical latency and cost distribution for GPT-5.4 Pro image generation across prompts, times, and regions, and how much variability should systems expect?

Investor overlay

Read-throughs

  • Threshold-based pricing above 272,000 tokens could raise effective costs for long-context workloads, shifting demand toward shorter-context designs or selective context strategies and influencing vendor margin mix if long-context adoption grows.
  • If a general model consistently beats a coding-specialist SKU on coding benchmarks, enterprise buyers may consolidate toolchains onto fewer SKUs, affecting attach rates and pricing power of specialized developer offerings.
  • Large internal gains on spreadsheet modeling tasks suggest potential productivity improvements for financial modeling workflows, but investability depends on third-party reproducibility and measurable reductions in model error and correction time.

What would confirm

  • Published per-token rates for GPT-5.4 at and above the 272,000-token threshold, plus clear definition of how the threshold is applied and billed; customer disclosures showing material spend impact on long-context usage.
  • Independent benchmark results showing GPT-5.4 outperforming GPT-5.3-Codex on clearly described coding tasks with variance and constraints; customer migration from coding-specialist SKUs to the general model.
  • Third-party evaluations reproducing the spreadsheet modeling improvement and demonstrating lower real-world spreadsheet or model error rates and reduced correction burden in analyst-like workflows.

What would kill

  • Pricing clarification showing the 272,000-token threshold rarely applies in practice or the effective uplift is minimal, limiting any unit economics or demand impact from long-context pricing mechanics.
  • Independent coding evaluations failing to show GPT-5.4 superiority over GPT-5.3-Codex, or a new coding-specialized SKU that re-establishes a performance gap and preserves the need for specialized developer models.
  • External tests showing the spreadsheet modeling benchmark does not generalize to real workflows or produces similar error rates and correction time as prior models, reducing the practical productivity read-through.

Sources

  1. 2026-03-05 simonwillison.net