Core Capability Bottlenecks: Persistent Plasticity And Causal Reasoning

Issue 76 Edition 2026-03-17 8 min read

General

Sources: 1 • Confidence: Medium • Updated: 2026-04-11 18:05

Key takeaways

Misra claims that LLMs do Bayesian-style updating during an interaction but do not retain learning across sessions because their weights are frozen after training.
Misra models an LLM as an implicit extremely large but sparse mapping from prompts to next-token probability distributions, approximated via compression rather than explicit storage.
Misra reports that in wind-tunnel experiments, transformers matched the Bayesian posterior to about 1e-3 bits accuracy.
Misra argues that passing the Turing test or doing economically useful work are insufficient definitions of AGI because they do not imply autonomous performance without human intervention.
Misra rejects claims that current LLMs are conscious or have an inner monologue.

Sections

Core Capability Bottlenecks: Persistent Plasticity And Causal Reasoning

Misra claims that LLMs do Bayesian-style updating during an interaction but do not retain learning across sessions because their weights are frozen after training.
Misra claims AGI-level progress requires solving two core problems: robust plasticity via continual learning and the ability to build causal models from data efficiently.
A speaker identifies Pearl’s causal hierarchy and do-calculus framework as an appropriate theoretical structure for advancing from association to intervention and counterfactual reasoning for grounded simulation.
Misra claims current deep learning primarily captures correlation (association) rather than causal reasoning that supports intervention and counterfactual simulation (as in Pearl’s causal hierarchy).
Misra frames deep learning as closer to Shannon-entropy-style correlation learning, while human-level insight is linked to low Kolmogorov-complexity representations (short programs) that explain data.
Misra claims progress toward intelligence requires moving from correlation to causation and that this shift should change how intelligence is conceptualized and engineered.

Bayesian In-Context Learning And Diagnostic Tooling

Misra models an LLM as an implicit extremely large but sparse mapping from prompts to next-token probability distributions, approximated via compression rather than explicit storage.
Misra says that after OpenAI removed token-probability visibility in its interface, his group built TokenProbe (tokenprobe.cs.columbia.edu) to inspect next-token probabilities and entropy for open-weight models.
Misra characterizes in-context learning as Bayesian-style belief updating in which token probabilities shift toward the demonstrated output format as examples are added to the prompt.
Misra proposes a controlled 'Bayesian wind tunnel' methodology using tasks that are too combinatorial to memorize but have analytically computable posteriors to test whether architectures perform Bayesian inference.
Misra claims that geometric signatures associated with Bayesian updating in small controlled models also appear in larger open-weight production LLMs, though messier.
Misra describes a Google Research paper as using an RLHF-like approach to teach LLMs to perform Bayesian learning more properly.

Architecture Comparisons Under Controlled Inference Tests

Misra reports that in wind-tunnel experiments, transformers matched the Bayesian posterior to about 1e-3 bits accuracy.
Misra reports that in the same wind-tunnel experiments, Mamba performed well on most tasks.
Misra reports that in the same wind-tunnel experiments, LSTMs only partially matched the Bayesian posterior behavior.
Misra reports that in the same wind-tunnel experiments, MLPs failed the Bayesian-inference tasks.

Agi Evaluation: Autonomy Emphasis And Discovery-Style Tests

Misra argues that passing the Turing test or doing economically useful work are insufficient definitions of AGI because they do not imply autonomous performance without human intervention.
Misra proposed an AGI criterion: an LLM trained only on pre-1916 physics should be able to derive the theory of relativity.
Misra reports that Demis Hassabis has publicly mentioned a similar 'Einstein/relativity' style AGI test at an India AI Summit.
Misra predicts that in the near term, frontier models may complete well-defined, well-scoped coding tasks without human intervention.

Disputes About Consciousness, Agency, And Interpretation Of Behaviors

Misra rejects claims that current LLMs are conscious or have an inner monologue.
Misra attributes apparently deceptive or self-preserving behaviors in LLM outputs to training-data content rather than to the architecture having intrinsic goals.
Misra reports that Dario Amodei has said one cannot rule out LLM consciousness and that Misra explicitly disagrees with that assessment.
Casado argues that recent viral examples suggesting LLM generality (including Donald Knuth’s experience) do not demonstrate true general intelligence.

Unknowns

What are the exact task designs, datasets, training procedures, and evaluation metrics used in the 'Bayesian wind tunnel' experiments, and are they independently replicated?
What specific evidence would adjudicate the consciousness dispute (e.g., agreed operational criteria or tests), and do leading labs converge on any such criteria?
What concrete, testable scoring rubric would make the 'pre-1916 to relativity' AGI test operational (acceptable derivations, verification against predictions, contamination checks)?
What are the measured business outcomes of the ESPN deployment (accuracy, latency, user adoption, maintenance burden, error recovery) and how do they compare to later RAG or fine-tuned approaches?
How robust are context/memory-update 'pseudo-plasticity' approaches over long horizons (drift, compounding errors, scaling of memory, adversarial susceptibility) compared to true continual learning?

Investor overlay

Read-throughs

Continual learning and durable memory could become key differentiators for AI products and infrastructure, since current adaptation is described as context-bound rather than weight-plasticity.
Diagnostic tooling that inspects token probabilities and entropy could see adoption for auditing and education, reflecting the emphasis on mechanistic evaluation and tools like TokenProbe.
Controlled inference tests such as Bayesian wind-tunnel style evaluations could influence model selection and architecture bets if replication confirms cross-architecture ranking claims.

What would confirm

Independent replication with disclosed tasks, datasets, training procedures, and metrics showing transformers match Bayesian posteriors near the reported accuracy and clarifying where other architectures fall short.
Public, repeatable benchmarks or rubrics for autonomy and discovery-style tests, including scoring and contamination checks, becoming adopted by multiple labs or used in product evaluation.
Measured business outcomes from deployments cited in the summary, including accuracy, latency, adoption, maintenance burden, and recovery behavior, enabling comparison to newer RAG or fine-tuned systems.

What would kill

Replications fail to reproduce the Bayesian posterior matching results or show the effect is benchmark-specific, undermining the proposed controlled diagnostic paradigm.
Pseudo-plasticity approaches show poor long-horizon robustness, with drift, compounding errors, memory scaling issues, or adversarial susceptibility, reducing their usefulness versus true continual learning.
No convergence emerges on operational criteria for autonomy or consciousness-related disputes, and proposed discovery-style tests remain non-operational, limiting their impact on evaluation and procurement.

Sources

What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

2026-03-17 a16z.simplecast.com