
Jane Austen SLM
A fine-tuned miniature language model that was trained on Jane Austen's writing style.
Overview
Jane Austen SLM is a focused experiment in style-driven text generation. The project fine-tunes a compact language model on Jane Austen’s complete works to produce original prose that echoes her tone, cadence, and narrative style. It began as an attempt to build a small model from scratch, but evolved into a more practical fine-tuning pipeline after running into real-world data limitations.
Problem
The original goal was straightforward but unrealistic: train a language model purely on Austen’s writing and have it generate convincing 19th-century prose.
That ran into a hard constraint quickly. Even after aggregating all available texts, the dataset was too small to support training a model from scratch. Early experiments led to rapid overfitting and repetitive outputs. The model would memorise patterns instead of learning language structure.
So the problem shifted from “how to train a model on Austen” to “how to inject Austen’s style into a model that already understands English.”
Solution
The project pivots to fine-tuning a lightweight pre-trained model instead of building one from zero.
Design Approach
- Use a compact base model (DistilGPT-2) to keep compute requirements reasonable
- Build a clean, unified corpus from Austen’s works
- Fine-tune just enough to imprint style without degrading coherence
Customisation Options
- Prompt-based generation allows experimenting with tone and narrative direction
- Training parameters (epochs, batch size, learning rate) are adjustable for different stylistic intensity
- Corpus pipeline can be reused for other authors or domains
Architecture Notes
- Data ingestion is automated from Project Gutenberg
- Cleaning pipeline removes boilerplate and inconsistencies
- Text is chunked and tokenised for efficient training
- Training uses Hugging Face’s Trainer API with GPU acceleration when available
A key insight during development was that less training often produced better results. Over-training caused the model to become overly rigid and repetitive, while lighter fine-tuning preserved fluency and added just enough stylistic flavour.
Developer Notes
Building and Running
- The entire workflow is packaged in a single notebook for simplicity
- Running it in Google Colab avoids local setup friction
- Execution is linear: download → clean → preprocess → train → generate
Testing
- Outputs were evaluated manually using prompt-based sampling
- Early tests revealed common failure modes like repetition and structural drift
- Iterative tweaks focused on balancing coherence with stylistic imitation
Extensibility
- The pipeline is reusable for other authors or niche corpora
- Swapping datasets requires minimal changes
- Model size can be scaled depending on available hardware
Behind the Scenes
- The initial attempt to train from scratch failed fast, which helped define the project direction early
- Cleaning the Gutenberg texts took more effort than expected due to inconsistent formatting across files
- The most noticeable improvement came not from architecture changes, but from better preprocessing
- Fine-tuning felt less like “training a model” and more like “nudging it toward a personality”
This project is less about achieving state-of-the-art performance and more about understanding practical constraints in applied machine learning. It reflects a shift from idealistic design to workable engineering decisions.