Public-Domain-Only Training As A Legal-Risk Pathway With Capability Uncertainty

Issue 89 Edition 2026-03-30 7 min read

General

Sources: 1 • Confidence: High • Updated: 2026-03-31 04:42

Key takeaways

The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.

The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.
The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
Trip Venturella released a language model called Mr. Chatterbox that was trained on out-of-copyright British Library texts.
The corpus states as an expectation that a model trained only on out-of-copyright text may be difficult to make useful compared to models trained on large scraped modern corpora.

The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
The corpus reports that the 2022 Chinchilla paper suggests an approximate 20-to-1 ratio of training tokens to parameter count for compute-optimal training.
Applying the Chinchilla heuristic as described in the corpus implies that a 340M-parameter model would target roughly 7B training tokens, which is more than twice the 2.93B tokens used here.
In the author's testing, Mr. Chatterbox produces responses with Victorian flavor but often fails to answer questions usefully, and the author describes it as feeling more like a Markov chain than an LLM.

The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
Trip Venturella trained Mr. Chatterbox using Andrej Karpathy's nanochat, and the author used Claude Code to create a Python runner and then an LLM plugin, using details from the Spaces demo source code.
The author published an LLM plugin named llm-mrchatterbox that can be installed with the command "llm install llm-mrchatterbox".
The author reports that having Claude Code build a full LLM model plugin from scratch worked well and expects to use this approach again.

The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
On first prompt, the llm-mrchatterbox plugin fetches the 2.05GB model file from Hugging Face before responding.
Users can run a one-off prompt using "llm -m mrchatterbox", start an interactive session using "llm chat -m mrchatterbox", or run it via "uvx" without installing LLM first.
The cached Mr. Chatterbox model file can be removed using the command "llm mrchatterbox delete-model".

The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.

What objective evaluations (benchmarks or structured task suites) does Mr. Chatterbox achieve, beyond the author's subjective testing?
What are the exact model architecture details and parameter count, and how were they chosen relative to the available public-domain token budget?
What training hyperparameters and compute budget were used (steps, batch size, optimizer, learning rate schedule), and what evidence exists that training converged appropriately?
How much additional public-domain text is realistically available in the same or adjacent domains, and what token scale increases are feasible for future runs?
Does increasing public-domain token count (e.g., doubling or quadrupling) materially improve conversational usefulness for this model family under the same tooling stack?

Public-domain-only LLM training can be operationalized as a clean-data pipeline, potentially reducing legal-risk exposure versus mixed-license corpora, but capability remains uncertain at current scale and domain.
Model quality limits may be driven by token scarcity under public-domain constraints, aligning with compute-optimality framing that suggests undertraining relative to parameter count.
Packaging patterns that turn research weights into locally usable software and plugins may lower adoption friction and speed experimentation, independent of model quality.

Release of objective evaluations showing Mr. Chatterbox performance on benchmarks or structured task suites, enabling comparison to subjective testing claims.
Future training runs that materially increase public-domain token count and report clear improvements in conversational usefulness under the same tooling stack.
More complete disclosures of architecture, parameter count, hyperparameters, compute budget, and convergence evidence supporting that limitations are data-bound rather than training-process issues.

Objective evaluations indicate performance remains weak for conversational use despite increased token scale within similar public-domain constraints.
Disclosed architecture and training details show misconfiguration or non-convergence, weakening the interpretation that token scarcity is the primary bottleneck.
Public-domain text availability proves insufficient to scale tokens meaningfully beyond current runs in adjacent domains, limiting headroom for capability improvement.