Capability Drivers: Reinforcement Learning, Evaluation Density, And Continued Pre-Training Gains
Sources: 1 • Confidence: Medium • Updated: 2026-04-02 03:49
Key takeaways
- Nathan Labenz argues the objection that frontier systems are only next-token predictors is outdated because frontier systems are now substantially trained with reinforcement learning signals such as whether the answer is correct.
- Owen Zhang claims alignment remains hard for goal-directed reinforcement learning agents because specifying a fully correct reward objective is unresolved, and he states Nathan Labenz agrees this is serious.
- Nathan Labenz argues decoupling from and racing China in AI is deeply unwise compared to attempting shared understanding between humans across countries.
- Nathan Labenz claims that well-informed experts still disagree radically about what will happen with AI despite rapidly improving information about AI trajectories.
- Owen Zhang claims that long-horizon real-world agentic tasks may resist fast-verifier reinforcement learning flywheels because reward signals are noisier and require many more day-to-day feedback cues than text-native tasks.
Sections
Capability Drivers: Reinforcement Learning, Evaluation Density, And Continued Pre-Training Gains
- Nathan Labenz argues the objection that frontier systems are only next-token predictors is outdated because frontier systems are now substantially trained with reinforcement learning signals such as whether the answer is correct.
- Nathan Labenz reports that the DeepSeek R1 paper described an 'aha moment' during reinforcement learning where higher-order cognitive behaviors appeared in reasoning traces, including explicit reconsideration and switching solution approaches.
- Nathan Labenz claims that for hard-to-verify domains, reinforcement learning training can use detailed rubric-based rewards, citing an example of HealthBench having 49,000 evaluation criteria that provide graded signals beyond binary outcomes.
- Nathan Labenz predicts that AI better than the vast majority of people at most cognitive work is near-term and will be economically and socially transformative even if some human niches remain.
- Nathan Labenz predicts reinforcement learning scaling plus ongoing conceptual advances will likely be sufficient to reach AI that can perform most cognitive work in the economy, and will make today’s paradigm debates look obsolete in hindsight.
- Nathan Labenz claims AI progress is driven by multiple exponentials including number of researchers, papers, experiments, available compute, datasets, and reinforcement learning environments.
Safety Posture: No Silver Bullet, Defense-In-Depth, And Perceived Underinvestment
- Owen Zhang claims alignment remains hard for goal-directed reinforcement learning agents because specifying a fully correct reward objective is unresolved, and he states Nathan Labenz agrees this is serious.
- Nathan Labenz claims safety work is underfunded relative to capability development because resources spent on making AIs more powerful dwarf resources spent on mitigation.
- Nathan Labenz reports that Holden Karnofsky argued the AI safety community is not collectively trying as hard as it should relative to the level of risk being run.
- Nathan Labenz claims there are no widely credible claims that any single non-cybersecurity intervention will reliably solve AI safety, though multiple partial efforts might add up to a win.
- Nathan Labenz claims no one currently has a single alignment approach that works well enough to provide high confidence, and he claims frontier labs are therefore pursuing defense-in-depth.
- Nathan Labenz claims interpretability efforts such as Goodfire's 'intentional design' agenda aim to understand what models learn during training and shape learning away from problematic behaviors.
Governance And Geopolitics: Coordination Focus, Anti-Nationalization Stance, And Us–China Engagement Proposals
- Nathan Labenz argues decoupling from and racing China in AI is deeply unwise compared to attempting shared understanding between humans across countries.
- Nathan Labenz states he prefers some government involvement in frontier AI because doing nothing is unlikely to be a winning approach, despite describing himself as generally libertarian and techno-optimist.
- Nathan Labenz states he opposes nationalizing frontier AI development and prefers competition with limited government restraints over government or military takeover.
- Nathan Labenz states he would oppose many politician-driven AI regulations such as banning medical advice or therapy from chatbots because he expects them to function as protectionist restrictions that harm consumers.
- Nathan Labenz states the US should start investing now in trust-building and researcher-to-researcher communication with China on AI.
- Nathan Labenz states government should focus on addressing race-dynamic coordination problems among AI labs and minimizing extreme risks rather than restricting everyday uses.
Timeline Compression And Persistent Expert Disagreement
- Nathan Labenz claims that well-informed experts still disagree radically about what will happen with AI despite rapidly improving information about AI trajectories.
- Nathan Labenz claims disagreements about AI progress and bottlenecks often stem from differing underlying priors or worldviews rather than shared evidence.
- Nathan Labenz claims AI timelines have compressed such that predicting AGI around 2035 is now considered a bearish view compared to five years ago.
- Nathan Labenz reports that Helen Toner has argued that expecting AGI around 2035 is now considered bearish relative to five years ago.
- An unidentified speaker claims miscommunication in AI impact debates can come from participants implicitly assuming different capability levels, and that surfacing those assumptions can reduce downstream disconnects.
- Nathan Labenz reports that a CSET workshop concluded persistent disagreement about recursive self-improvement is partly driven by different paradigms (such as bottleneck or O-ring vs jaggedness) that can absorb new evidence without converging.
Agent Usability And Long-Horizon Work: Memory Scaffolds Vs Sparse/Noisy Feedback
- Owen Zhang claims that long-horizon real-world agentic tasks may resist fast-verifier reinforcement learning flywheels because reward signals are noisier and require many more day-to-day feedback cues than text-native tasks.
- Nathan Labenz claims longer-horizon effectiveness for agents may be approximated by externalized memory practices such as daily notes and scratchpads that allow resuming work across discontinuous sessions.
- Nathan Labenz claims multi-decade horizon planning with sustained determination is rare among humans and remains a valid area of skepticism for whether AIs can match it.
- Nathan Labenz predicts the next major unlock may be improved usability, where models can enter new environments, infer context, and ramp up like a new employee, rather than fundamentally new capabilities.
- Nathan Labenz predicts model capability to recover from interruptions and resume work is improving rapidly and will continue to improve steeply for at least a few generations.
Watchlist
- Nathan Labenz recommends developing private internal benchmarks using familiar tasks as a practical way to track model capability changes over time.
- Nathan Labenz highlights a geopolitical tail risk that a move by mainland China on Taiwan could severely disrupt chip fabrication and thus AI progress.
- A potentially desirable 'sweet spot' would be AI that can complete a month or quarter of delegated work very cheaply without progressing to multi-decade strategic planning that increases loss-of-control risk.
- Nathan Labenz flags that frontier AI companies and leaders have long stated there is a significant risk of AI going badly, implying that governance should prioritize transparency and constraints on high-risk experiments.
- Nathan Labenz plans to visit China within the next year to contribute in a small way to US-China civilizational understanding.
Unknowns
- What independent, quantitative evidence supports the claim that model performance is at attending-physician level in realistic clinical workflows, and what are the measured failure modes and harm rates?
- What fraction of frontier model training compute is currently allocated to reinforcement learning or verification-heavy post-training relative to pre-training, and how has that changed over time?
- How reproducible are the reported reinforcement-learning-induced meta-cognitive behaviors across labs, architectures, and domains, and do they translate into materially higher success on real tasks?
- Do long-horizon agentic benchmarks improve at rates comparable to short-form, easily verified tasks, and what feedback designs (if any) close the gap?
- What are the true near-term bottlenecks for scaling: chip supply, permitting, power delivery, networking, or something else, and how sensitive is progress to geopolitical disruption scenarios?