Rosa Del Mar

Daily Brief

Issue 89 2026-03-30

Public-Domain-Only Training As A Legal-Risk Pathway With Capability Uncertainty

Issue 89 Edition 2026-03-30 7 min read
General
Sources: 1 • Confidence: High • Updated: 2026-03-31 04:42

Key takeaways

  • The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
  • The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
  • The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
  • The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
  • Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.

Sections

Public-Domain-Only Training As A Legal-Risk Pathway With Capability Uncertainty

  • The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
  • Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.
  • The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
  • Trip Venturella released a language model called Mr. Chatterbox that was trained on out-of-copyright British Library texts.
  • The corpus states as an expectation that a model trained only on out-of-copyright text may be difficult to make useful compared to models trained on large scraped modern corpora.

Data Scale And Compute-Optimality Framing (Chinchilla Heuristic) As An Explanation For Quality Limits

  • The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
  • The corpus reports that the 2022 Chinchilla paper suggests an approximate 20-to-1 ratio of training tokens to parameter count for compute-optimal training.
  • Applying the Chinchilla heuristic as described in the corpus implies that a 340M-parameter model would target roughly 7B training tokens, which is more than twice the 2.93B tokens used here.
  • In the author's testing, Mr. Chatterbox produces responses with Victorian flavor but often fails to answer questions usefully, and the author describes it as feeling more like a Markov chain than an LLM.

Tooling Pattern: Turning Research Weights Into Usable Local Software Via Ai-Assisted Coding And Plugins

  • The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
  • Trip Venturella trained Mr. Chatterbox using Andrej Karpathy's nanochat, and the author used Claude Code to create a Python runner and then an LLM plugin, using details from the Spaces demo source code.
  • The author published an LLM plugin named llm-mrchatterbox that can be installed with the command "llm install llm-mrchatterbox".
  • The author reports that having Claude Code build a full LLM model plugin from scratch worked well and expects to use this approach again.

Operational Constraints For Evaluation/Adoption: Artifact Size, Download-On-First-Use, And Cleanup

  • The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
  • On first prompt, the llm-mrchatterbox plugin fetches the 2.05GB model file from Hugging Face before responding.
  • Users can run a one-off prompt using "llm -m mrchatterbox", start an interactive session using "llm chat -m mrchatterbox", or run it via "uvx" without installing LLM first.
  • The cached Mr. Chatterbox model file can be removed using the command "llm mrchatterbox delete-model".

Watchlist

  • The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.

Unknowns

  • What objective evaluations (benchmarks or structured task suites) does Mr. Chatterbox achieve, beyond the author's subjective testing?
  • What are the exact model architecture details and parameter count, and how were they chosen relative to the available public-domain token budget?
  • What training hyperparameters and compute budget were used (steps, batch size, optimizer, learning rate schedule), and what evidence exists that training converged appropriately?
  • How much additional public-domain text is realistically available in the same or adjacent domains, and what token scale increases are feasible for future runs?
  • Does increasing public-domain token count (e.g., doubling or quadrupling) materially improve conversational usefulness for this model family under the same tooling stack?

Investor overlay

Read-throughs

  • Public-domain-only LLM training can be operationalized as a clean-data pipeline, potentially reducing legal-risk exposure versus mixed-license corpora, but capability remains uncertain at current scale and domain.
  • Model quality limits may be driven by token scarcity under public-domain constraints, aligning with compute-optimality framing that suggests undertraining relative to parameter count.
  • Packaging patterns that turn research weights into locally usable software and plugins may lower adoption friction and speed experimentation, independent of model quality.

What would confirm

  • Release of objective evaluations showing Mr. Chatterbox performance on benchmarks or structured task suites, enabling comparison to subjective testing claims.
  • Future training runs that materially increase public-domain token count and report clear improvements in conversational usefulness under the same tooling stack.
  • More complete disclosures of architecture, parameter count, hyperparameters, compute budget, and convergence evidence supporting that limitations are data-bound rather than training-process issues.

What would kill

  • Objective evaluations indicate performance remains weak for conversational use despite increased token scale within similar public-domain constraints.
  • Disclosed architecture and training details show misconfiguration or non-convergence, weakening the interpretation that token scarcity is the primary bottleneck.
  • Public-domain text availability proves insufficient to scale tokens meaningfully beyond current runs in adjacent domains, limiting headroom for capability improvement.

Sources