Rosa Del Mar

Daily Brief

Issue 76 2026-03-17

Core Capability Bottlenecks: Persistent Plasticity And Causal Reasoning

Issue 76 Edition 2026-03-17 8 min read
General
Sources: 1 • Confidence: Medium • Updated: 2026-04-11 18:05

Key takeaways

  • Misra claims that LLMs do Bayesian-style updating during an interaction but do not retain learning across sessions because their weights are frozen after training.
  • Misra models an LLM as an implicit extremely large but sparse mapping from prompts to next-token probability distributions, approximated via compression rather than explicit storage.
  • Misra reports that in wind-tunnel experiments, transformers matched the Bayesian posterior to about 1e-3 bits accuracy.
  • Misra argues that passing the Turing test or doing economically useful work are insufficient definitions of AGI because they do not imply autonomous performance without human intervention.
  • Misra rejects claims that current LLMs are conscious or have an inner monologue.

Sections

Core Capability Bottlenecks: Persistent Plasticity And Causal Reasoning

  • Misra claims that LLMs do Bayesian-style updating during an interaction but do not retain learning across sessions because their weights are frozen after training.
  • Misra claims AGI-level progress requires solving two core problems: robust plasticity via continual learning and the ability to build causal models from data efficiently.
  • A speaker identifies Pearl’s causal hierarchy and do-calculus framework as an appropriate theoretical structure for advancing from association to intervention and counterfactual reasoning for grounded simulation.
  • Misra claims current deep learning primarily captures correlation (association) rather than causal reasoning that supports intervention and counterfactual simulation (as in Pearl’s causal hierarchy).
  • Misra frames deep learning as closer to Shannon-entropy-style correlation learning, while human-level insight is linked to low Kolmogorov-complexity representations (short programs) that explain data.
  • Misra claims progress toward intelligence requires moving from correlation to causation and that this shift should change how intelligence is conceptualized and engineered.

Bayesian In-Context Learning And Diagnostic Tooling

  • Misra models an LLM as an implicit extremely large but sparse mapping from prompts to next-token probability distributions, approximated via compression rather than explicit storage.
  • Misra says that after OpenAI removed token-probability visibility in its interface, his group built TokenProbe (tokenprobe.cs.columbia.edu) to inspect next-token probabilities and entropy for open-weight models.
  • Misra characterizes in-context learning as Bayesian-style belief updating in which token probabilities shift toward the demonstrated output format as examples are added to the prompt.
  • Misra proposes a controlled 'Bayesian wind tunnel' methodology using tasks that are too combinatorial to memorize but have analytically computable posteriors to test whether architectures perform Bayesian inference.
  • Misra claims that geometric signatures associated with Bayesian updating in small controlled models also appear in larger open-weight production LLMs, though messier.
  • Misra describes a Google Research paper as using an RLHF-like approach to teach LLMs to perform Bayesian learning more properly.

Architecture Comparisons Under Controlled Inference Tests

  • Misra reports that in wind-tunnel experiments, transformers matched the Bayesian posterior to about 1e-3 bits accuracy.
  • Misra reports that in the same wind-tunnel experiments, Mamba performed well on most tasks.
  • Misra reports that in the same wind-tunnel experiments, LSTMs only partially matched the Bayesian posterior behavior.
  • Misra reports that in the same wind-tunnel experiments, MLPs failed the Bayesian-inference tasks.

Agi Evaluation: Autonomy Emphasis And Discovery-Style Tests

  • Misra argues that passing the Turing test or doing economically useful work are insufficient definitions of AGI because they do not imply autonomous performance without human intervention.
  • Misra proposed an AGI criterion: an LLM trained only on pre-1916 physics should be able to derive the theory of relativity.
  • Misra reports that Demis Hassabis has publicly mentioned a similar 'Einstein/relativity' style AGI test at an India AI Summit.
  • Misra predicts that in the near term, frontier models may complete well-defined, well-scoped coding tasks without human intervention.

Disputes About Consciousness, Agency, And Interpretation Of Behaviors

  • Misra rejects claims that current LLMs are conscious or have an inner monologue.
  • Misra attributes apparently deceptive or self-preserving behaviors in LLM outputs to training-data content rather than to the architecture having intrinsic goals.
  • Misra reports that Dario Amodei has said one cannot rule out LLM consciousness and that Misra explicitly disagrees with that assessment.
  • Casado argues that recent viral examples suggesting LLM generality (including Donald Knuth’s experience) do not demonstrate true general intelligence.

Unknowns

  • What are the exact task designs, datasets, training procedures, and evaluation metrics used in the 'Bayesian wind tunnel' experiments, and are they independently replicated?
  • What specific evidence would adjudicate the consciousness dispute (e.g., agreed operational criteria or tests), and do leading labs converge on any such criteria?
  • What concrete, testable scoring rubric would make the 'pre-1916 to relativity' AGI test operational (acceptable derivations, verification against predictions, contamination checks)?
  • What are the measured business outcomes of the ESPN deployment (accuracy, latency, user adoption, maintenance burden, error recovery) and how do they compare to later RAG or fine-tuned approaches?
  • How robust are context/memory-update 'pseudo-plasticity' approaches over long horizons (drift, compounding errors, scaling of memory, adversarial susceptibility) compared to true continual learning?

Investor overlay

Read-throughs

  • Continual learning and durable memory could become key differentiators for AI products and infrastructure, since current adaptation is described as context-bound rather than weight-plasticity.
  • Diagnostic tooling that inspects token probabilities and entropy could see adoption for auditing and education, reflecting the emphasis on mechanistic evaluation and tools like TokenProbe.
  • Controlled inference tests such as Bayesian wind-tunnel style evaluations could influence model selection and architecture bets if replication confirms cross-architecture ranking claims.

What would confirm

  • Independent replication with disclosed tasks, datasets, training procedures, and metrics showing transformers match Bayesian posteriors near the reported accuracy and clarifying where other architectures fall short.
  • Public, repeatable benchmarks or rubrics for autonomy and discovery-style tests, including scoring and contamination checks, becoming adopted by multiple labs or used in product evaluation.
  • Measured business outcomes from deployments cited in the summary, including accuracy, latency, adoption, maintenance burden, and recovery behavior, enabling comparison to newer RAG or fine-tuned systems.

What would kill

  • Replications fail to reproduce the Bayesian posterior matching results or show the effect is benchmark-specific, undermining the proposed controlled diagnostic paradigm.
  • Pseudo-plasticity approaches show poor long-horizon robustness, with drift, compounding errors, memory scaling issues, or adversarial susceptibility, reducing their usefulness versus true continual learning.
  • No convergence emerges on operational criteria for autonomy or consciousness-related disputes, and proposed discovery-style tests remain non-operational, limiting their impact on evaluation and procurement.

Sources