Rosa Del Mar

Daily Brief

Issue 64 2026-03-05

Model Capability Positioning For Coding

Issue 64 Edition 2026-03-05 5 min read
Not accepted General
Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:22

Key takeaways

  • It is uncertain whether a GPT-5.4 Codex variant will be released or whether the Codex line has been merged into the main model family.
  • GPT-5.4 pricing is slightly higher than the GPT-5.2 family, and both GPT-5.4 models cost more when usage exceeds 272,000 tokens.
  • On an internal benchmark of spreadsheet modeling tasks resembling junior investment banking analyst work, GPT-5.4 scored 87.3% versus 68.4% for GPT-5.2.
  • In one reported run, generating an image with GPT-5.4 Pro took 4 minutes 45 seconds and cost $1.55.
  • GPT-5.4 outperforms the coding-specialist GPT-5.3-Codex on relevant benchmarks.

Sections

Model Capability Positioning For Coding

  • It is uncertain whether a GPT-5.4 Codex variant will be released or whether the Codex line has been merged into the main model family.
  • GPT-5.4 outperforms the coding-specialist GPT-5.3-Codex on relevant benchmarks.

Pricing And Long Context Cost Structure

  • GPT-5.4 pricing is slightly higher than the GPT-5.2 family, and both GPT-5.4 models cost more when usage exceeds 272,000 tokens.

Spreadsheet Analytic Task Performance

  • On an internal benchmark of spreadsheet modeling tasks resembling junior investment banking analyst work, GPT-5.4 scored 87.3% versus 68.4% for GPT-5.2.

Image Generation Latency And Task Cost

  • In one reported run, generating an image with GPT-5.4 Pro took 4 minutes 45 seconds and cost $1.55.

Watchlist

  • It is uncertain whether a GPT-5.4 Codex variant will be released or whether the Codex line has been merged into the main model family.

Unknowns

  • What are the published per-token rates for GPT-5.4 at or below 272,000 tokens versus above 272,000 tokens, and how is the threshold applied in billing?
  • Which specific coding benchmarks support the claim that GPT-5.4 outperforms GPT-5.3-Codex, and what are the exact results and evaluation conditions?
  • Will there be a distinct GPT-5.4 Codex variant, and if not, what explicit product-line change (merge, rename, or deprecation) is announced?
  • What is the design of the internal spreadsheet modeling benchmark (task set, grading rubric, error tolerance), and are there third-party replications comparing GPT-5.4 to GPT-5.2?
  • What is the typical (median/p95) latency and cost distribution for GPT-5.4 Pro image generation across prompts and times, and what factors drive variance?

Investor overlay

Read-throughs

  • If GPT-5.4 general models outperform a coding specialist SKU, product positioning may shift toward a single flagship for coding and analytics, potentially reducing the need for separate Codex branding.
  • A 272000 token step up in costs could meaningfully change economics for long context workloads, influencing customer usage patterns, architecture choices, and willingness to adopt GPT-5.4 for heavy context tasks.
  • Large reported gains on internal spreadsheet modeling tasks suggest improved fit for finance and analytics workflows, which could increase demand from business users if results generalize beyond the internal benchmark.

What would confirm

  • Clear product announcement on whether GPT-5.4 Codex exists or whether Codex is merged, renamed, or deprecated, including how coding SKUs are positioned relative to the main model family.
  • Published pricing tables detailing GPT-5.4 per token rates below and above 272000 tokens and explicit billing mechanics for how the threshold is applied.
  • Third party or reproducible benchmark disclosures for the spreadsheet and coding claims, including task sets, grading rubric, and results comparing GPT-5.4 with GPT-5.2 and GPT-5.3-Codex.

What would kill

  • Evidence that GPT-5.4 does not consistently beat GPT-5.3-Codex on credible coding benchmarks or that evaluation conditions favor GPT-5.4 in ways not reflective of real usage.
  • Clarification that the 272000 token threshold rarely triggers or is applied differently than implied, limiting any practical change in effective unit cost for long context workloads.
  • External tests showing the internal spreadsheet modeling delta does not replicate or yields materially smaller gains versus GPT-5.2, reducing the case for improved finance analytics capability.

Sources

  1. 2026-03-05 simonwillison.net