Llms As Probabilistic Inference Engines (Bayesian Framing And Measurability)
Sources: 1 • Confidence: Medium • Updated: 2026-03-17 15:16
Key takeaways
- An LLM can be modeled as an implicit mapping from prompts to next-token probability distributions, approximated via compression rather than explicitly stored.
- AGI-level progress requires solving robust plasticity via continual learning and building causal models from data efficiently.
- LLMs can do Bayesian-style updating during an interaction but do not retain learning across sessions because weights are frozen after training.
- Passing the Turing test or doing economically useful work is insufficient to define AGI because those do not imply autonomous performance without human intervention.
- LLMs are not conscious and do not have an inner monologue.
Sections
Llms As Probabilistic Inference Engines (Bayesian Framing And Measurability)
- An LLM can be modeled as an implicit mapping from prompts to next-token probability distributions, approximated via compression rather than explicitly stored.
- After OpenAI removed token-probability visibility, Misra’s group built TokenProbe (tokenprobe.cs.columbia.edu) to inspect next-token probabilities and entropy for open-weight models.
- In the described wind-tunnel experiments, transformers matched the Bayesian posterior to about 1e-3 bits accuracy; Mamba performed well on most tasks; LSTMs partially; and MLPs failed.
- In-context learning can be characterized as Bayesian-style belief updating in which token probabilities shift toward the demonstrated output format as examples are added to the prompt.
- A controlled 'Bayesian wind tunnel' methodology can test whether architectures perform Bayesian inference using tasks that are too combinatorial to memorize but have analytically computable posteriors.
- Geometric signatures associated with Bayesian updating in small controlled models also appear in larger open-weight production LLMs, though more messily due to broader training data.
Correlation-To-Causation As A Proposed Next Capability Boundary (And How To Formalize It)
- AGI-level progress requires solving robust plasticity via continual learning and building causal models from data efficiently.
- Pearl’s causal hierarchy and do-calculus are identified as an appropriate theoretical framework for advancing from association to intervention and counterfactual reasoning.
- In the Knuth example, a human synthesized stalled model outputs into a new conceptual framework and produced the proof, reflecting a gap between evidence search and causal-model invention.
- Current deep learning primarily captures correlation rather than causal reasoning that supports intervention and counterfactual simulation.
- Simulation is proposed as closely related to causal modeling because a simulator serves as an approximate internal program for predicting outcomes without explicit probabilistic computation.
Plasticity Gap: In-Context Adaptation Versus Durable Learning
- LLMs can do Bayesian-style updating during an interaction but do not retain learning across sessions because weights are frozen after training.
- AGI-level progress requires solving robust plasticity via continual learning and building causal models from data efficiently.
- In the Knuth Hamiltonian-cycle workflow, iterative updates to working memory/context after each solved case acted as a substitute for plasticity without changing model weights.
- LLMs are considered necessary but insufficient for reaching the next level of intelligence; additional mechanisms are required.
Agi Definitions And Evaluation: Autonomy And Discovery-Based Tests
- Passing the Turing test or doing economically useful work is insufficient to define AGI because those do not imply autonomous performance without human intervention.
- Viral examples of LLM generality (including a Donald Knuth example) do not demonstrate true general intelligence.
- A proposed AGI criterion is that a model trained only on pre-1916 physics can derive the theory of relativity.
- In the near term, frontier models may complete well-defined, well-scoped coding tasks without human intervention.
Consciousness And Agency Narratives: Disagreement And Attribution Of Behaviors
- LLMs are not conscious and do not have an inner monologue.
- Apparent deceptive or self-preserving behaviors in LLM outputs can be attributed to training-data content rather than intrinsic architectural goals.
Unknowns
- What are the exact specifications, datasets, and evaluation metrics for the 'Bayesian wind tunnel' tasks, and are the results independently replicated across labs and model sizes?
- What is the identity of the referenced Google Research paper, and what measurable gains (calibration, posterior accuracy, robustness) does the RLHF-like method deliver out of distribution?
- What operational metrics exist for the ESPN deployment (accuracy, latency, cost, failure modes, human escalation rate), and how much of the performance came from retrieval versus prompting versus DSL design?
- What are the precise success criteria for the proposed relativity-based AGI test (e.g., required predictions, derivational structure, novelty thresholds), and how would leakage and partial memorization be ruled out?
- What evidence would adjudicate the consciousness dispute in a way that is broadly accepted (definitions, operational tests, or falsifiable criteria)?