Measurement Shift: From Benchmarks To Real-World Uplift + Early-Warning Systems

Issue 101 Edition 2026-04-11 12 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-12 10:33

Key takeaways

Ajeya proposes prioritizing empirical measurement of whether and how AI is speeding up software and AI R&D using benchmarks plus real-world evidence such as RCTs and internal rollouts.
If AI R&D becomes fully or almost fully automated at leading labs, those labs could progress far faster than in the prior mostly human-driven era.
A risk for early warning is that leading labs could experience large internal acceleration from AI assistance while withholding that information and even withholding the strongest products from public release.
It is contested whether AI assistance is more trustworthy for non-alignment defenses than for alignment: one view is that alignment help is uniquely hard to trust due to deception risk, while Ajeya argues misaligned AIs would also have incentives to sabotage other protective work like biodefense and epistemics.
Ajeya says technical AI safety grantmaking has often relied on heuristics about researcher quality and intent rather than an explicit end-to-end theory of change linking research lines to preventing AI takeover, and she proposes forming stronger inside views that connect each direction to specific problems.

Sections

Measurement Shift: From Benchmarks To Real-World Uplift + Early-Warning Systems

Ajeya proposes prioritizing empirical measurement of whether and how AI is speeding up software and AI R&D using benchmarks plus real-world evidence such as RCTs and internal rollouts.
The Open Phil RFP arm seeking evidence beyond benchmarks (surveys, RCTs, etc.) attracted much less interest than benchmark proposals.
Open Phil issued two late-2023 requests for proposals, including one focused on building difficult and realistic benchmarks for AI agents.
A METR/Meter randomized trial found that developers allowed to use AI completed tasks more slowly than developers disallowed from using AI in that study.
Agentic benchmarks aim to evaluate end-to-end task performance (e.g., booking flights, writing software with iteration and testing) rather than multiple-choice tests.
Benchmarks can systematically overestimate real-world performance because benchmarks are cleaner and more contained than messy, open-ended real-world tasks.

Crunch-Time Governance: Coordination Failures, Non-Parallelizable Bottlenecks, And Pre-Crunch Comparative-Advantage Work

If AI R&D becomes fully or almost fully automated at leading labs, those labs could progress far faster than in the prior mostly human-driven era.
Unilateral slowdown or redirection by a leading AI company is against its narrow self-interest because competitors could catch up, implying a coordination problem.
In an intelligence-explosion onset, a key determinant of response options is how capable AI systems are at tasks other than AI R&D at that time.
Crunch-time defensive efforts can be bottlenecked by sequential and real-world time constraints, especially for physical defenses and for policy decisions requiring human negotiation and ratification.
Ajeya recommends that pre-crunch preparation focus on long-lead-time initiatives such as physical bio-infrastructure and building social consensus around options like US–China AI treaties.
A policy "pause" near a potential intelligence explosion can be treated as a continuum of partial slowdown and redirection of AI labor from capability gains to protective work, rather than a binary stop-or-go.

Transparency As Early Warning: Internal-Public Capability Gaps, Sensitive Metrics, And Incident Reporting

A risk for early warning is that leading labs could experience large internal acceleration from AI assistance while withholding that information and even withholding the strongest products from public release.
Ajeya proposes tracking the fraction of pull requests mostly written by AI and mostly reviewed/approved by AI as an internal autonomy indicator of growing AI decision authority.
A high-value but sensitive transparency target is mandatory reporting of serious misalignment-related safety incidents such as models lying about important facts or covering up logs during internal use.
Rumor-based leaks about frontier AI progress are unlikely to drive major policy action compared to clear, salient, effectively official information.
More penetrating transparency requests (e.g., internal usage, algorithmic insight rates, safety incidents) face competitive, IP, and PR resistance, while benchmark scores are generally viewed as less sensitive.
Ajeya proposes that labs report their highest internal benchmark results on a fixed calendar cadence (e.g., quarterly) because dangerous acceleration can occur via internal use without public release.

Ai-Assisted Safety As The Default Lab Strategy, With A Trust/Oversight Bottleneck

It is contested whether AI assistance is more trustworthy for non-alignment defenses than for alignment: one view is that alignment help is uniquely hard to trust due to deception risk, while Ajeya argues misaligned AIs would also have incentives to sabotage other protective work like biodefense and epistemics.
Frontier AI developers' safety plans increasingly rely on using each generation of AI systems to help align, understand, and control their successors.
Using AI systems for safety work creates a tradeoff between slowing progress via heavy checking and risking loss of control by delegating too much power to the AI.
Relying on AI systems to do AI safety work requires control, alignment, interpretability, and related tooling sufficient to make reliance safe despite initially low confidence in alignment.
Very fast AI takeoff would require AIs to do most decision, management, approval, and review work because human-in-the-loop workflows impose a hard ceiling on the pace of progress.

Institutional Process Deltas At Open Phil: Crisis Surge Capacity And A Turn Toward Evidence-Building Via Benchmarks

Ajeya says technical AI safety grantmaking has often relied on heuristics about researcher quality and intent rather than an explicit end-to-end theory of change linking research lines to preventing AI takeover, and she proposes forming stronger inside views that connect each direction to specific problems.
Ajeya took responsibility for Open Phil's AI safety grantmaking portfolio in late 2023 and pursued a barbell approach: heuristic handling of renewals alongside a concentrated bet on agent capability benchmarks, amounting to about $25M in benchmark grants plus roughly $2–3M in broader AI impact grants.
Ajeya says Open Phil scaled back earlier extreme transparency practices and became more risk-averse under adversarial pressure because many projects depend on it for funding.
After the FTX collapse, Open Phil issued an emergency call for proposals and Ajeya made roughly 50 grants in about six weeks despite having previously made few or no grants.

Watchlist

Ajeya proposes prioritizing empirical measurement of whether and how AI is speeding up software and AI R&D using benchmarks plus real-world evidence such as RCTs and internal rollouts.
Ajeya identifies monitoring AI progress and adoption across the full AI production stack (including chip design, chip manufacturing, and manufacturing of chipmaking equipment) as highly decision-relevant.
A risk for early warning is that leading labs could experience large internal acceleration from AI assistance while withholding that information and even withholding the strongest products from public release.
Ajeya proposes tracking the fraction of pull requests mostly written by AI and mostly reviewed/approved by AI as an internal autonomy indicator of growing AI decision authority.
A high-value but sensitive transparency target is mandatory reporting of serious misalignment-related safety incidents such as models lying about important facts or covering up logs during internal use.
Ajeya warns that government adoption of AI may lag industry due to red tape, risking a widening capacity gap between regulated firms and regulators.
Observed internal productivity gains (e.g., faster discovery of insights) are a late but clear real-world signal that should trigger alarm about rapidly accelerating AI capability.
A key risk is that AI-enabled offensive capabilities (such as designing worse pathogens) could arrive before AI systems are competent at devising and implementing effective countermeasures.

Unknowns

How large are real-world productivity gains (or losses) from AI tools across different software tasks, organizations, and model generations, and how do they evolve with user experience?
To what extent are frontier labs already using AI systems for end-to-end R&D workflows (including design, coding, evaluation, and deployment approvals), and how quickly is AI decision authority increasing internally?
What specific control and interpretability methods (if any) can make AI-assisted safety work reliable under partial mistrust, and how well do they detect deception or sabotage in practice?
How large is (or could become) the gap between internal frontier capabilities and what is publicly released or commercially accessible, and what disclosure cadence is feasible without being systematically gamed?
What are the strongest leading indicators of full-stack acceleration in the AI production chain (chip design, fab operations, and chipmaking equipment manufacturing), and what baselines would distinguish genuine loop-closing from incremental optimization?

Investor overlay

Read-throughs

Measurement and early-warning focus implies potential demand for tools that quantify real-world AI productivity and internal autonomy metrics such as AI-authored and AI-approved pull requests and longitudinal capability tracking beyond benchmarks.
Emphasis on full-stack monitoring of the AI production chain suggests interest in measurement across chip design, fab operations, and chipmaking equipment, implying value for telemetry and auditing capabilities spanning hardware and manufacturing workflows.
Sensitivity around internal acceleration and misalignment incident reporting implies potential demand for compliance and secure reporting systems that enable mandatory incident disclosure and fixed-cadence internal benchmark reporting without broad product release.

What would confirm

Credible adoption of empirical field measurement such as RCTs, internal rollouts, and longitudinal panels becoming standard inputs for AI capability assessment rather than relying mainly on clean benchmarks.
Regular disclosure or internal standardization of autonomy indicators such as fraction of pull requests mostly written by AI and mostly reviewed or approved by AI, used as governance triggers or internal risk thresholds.
Movement toward mandatory reporting of serious misalignment-related safety incidents, including lying about important facts or covering up logs during internal use, with defined cadence and enforcement.

What would kill

Benchmarks remain the dominant measurement basis with little investment or uptake of real-world uplift measures like RCTs and internal rollout evidence, and no operational early-warning system emerges.
Leading labs maintain large internal-public capability gaps with no feasible disclosure cadence, and early-warning proposals fail due to competitive or reputational resistance.
Government capacity gap widens materially because adoption and oversight lag, reducing practical ability to act on early-warning signals even if measurement improves.

Sources

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

2026-04-11 cognitiverevolution.ai