Public-Domain-Only Training As A Legal-Risk Pathway With Capability Uncertainty
Sources: 1 • Confidence: High • Updated: 2026-03-31 04:42
Key takeaways
- The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
- The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
- The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
- The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
- Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.
Sections
Public-Domain-Only Training As A Legal-Risk Pathway With Capability Uncertainty
- The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
- Mr. Chatterbox was trained from scratch on British texts published between 1837 and 1899, with no training inputs from after 1899.
- The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
- Trip Venturella released a language model called Mr. Chatterbox that was trained on out-of-copyright British Library texts.
- The corpus states as an expectation that a model trained only on out-of-copyright text may be difficult to make useful compared to models trained on large scraped modern corpora.
Data Scale And Compute-Optimality Framing (Chinchilla Heuristic) As An Explanation For Quality Limits
- The training corpus used for Mr. Chatterbox comprised 28,035 books and approximately 2.93 billion input tokens after filtering.
- The corpus reports that the 2022 Chinchilla paper suggests an approximate 20-to-1 ratio of training tokens to parameter count for compute-optimal training.
- Applying the Chinchilla heuristic as described in the corpus implies that a 340M-parameter model would target roughly 7B training tokens, which is more than twice the 2.93B tokens used here.
- In the author's testing, Mr. Chatterbox produces responses with Victorian flavor but often fails to answer questions usefully, and the author describes it as feeling more like a Markov chain than an LLM.
Tooling Pattern: Turning Research Weights Into Usable Local Software Via Ai-Assisted Coding And Plugins
- The author ran Mr. Chatterbox locally by integrating it with the author's LLM framework and documented the process.
- Trip Venturella trained Mr. Chatterbox using Andrej Karpathy's nanochat, and the author used Claude Code to create a Python runner and then an LLM plugin, using details from the Spaces demo source code.
- The author published an LLM plugin named llm-mrchatterbox that can be installed with the command "llm install llm-mrchatterbox".
- The author reports that having Claude Code build a full LLM model plugin from scratch worked well and expects to use this approach again.
Operational Constraints For Evaluation/Adoption: Artifact Size, Download-On-First-Use, And Cleanup
- The Mr. Chatterbox model file is about 2.05GB on disk and is available via a Hugging Face Spaces demo.
- On first prompt, the llm-mrchatterbox plugin fetches the 2.05GB model file from Hugging Face before responding.
- Users can run a one-off prompt using "llm -m mrchatterbox", start an interactive session using "llm chat -m mrchatterbox", or run it via "uvx" without installing LLM first.
- The cached Mr. Chatterbox model file can be removed using the command "llm mrchatterbox delete-model".
Watchlist
- The author remains optimistic that a useful model can be trained entirely on public domain data and views this project as a promising start given it reached 2.93B tokens using nanochat.
Unknowns
- What objective evaluations (benchmarks or structured task suites) does Mr. Chatterbox achieve, beyond the author's subjective testing?
- What are the exact model architecture details and parameter count, and how were they chosen relative to the available public-domain token budget?
- What training hyperparameters and compute budget were used (steps, batch size, optimizer, learning rate schedule), and what evidence exists that training converged appropriately?
- How much additional public-domain text is realistically available in the same or adjacent domains, and what token scale increases are feasible for future runs?
- Does increasing public-domain token count (e.g., doubling or quadrupling) materially improve conversational usefulness for this model family under the same tooling stack?