Jane Austen SLM
Oct 2025 – Nov 2025

Jane Austen SLM

A fine-tuned miniature language model that was trained on Jane Austen's writing style.

PythonPyTorchJupyter Notebook

Overview

Jane Austen SLM is a focused experiment in style-driven text generation. The project fine-tunes a compact language model on Jane Austen’s complete works to produce original prose that echoes her tone, cadence, and narrative style. It began as an attempt to build a small model from scratch, but evolved into a more practical fine-tuning pipeline after running into real-world data limitations.


Problem

The original goal was straightforward but unrealistic: train a language model purely on Austen’s writing and have it generate convincing 19th-century prose.

That ran into a hard constraint quickly. Even after aggregating all available texts, the dataset was too small to support training a model from scratch. Early experiments led to rapid overfitting and repetitive outputs. The model would memorise patterns instead of learning language structure.

So the problem shifted from “how to train a model on Austen” to “how to inject Austen’s style into a model that already understands English.”


Solution

The project pivots to fine-tuning a lightweight pre-trained model instead of building one from zero.

Design Approach

  • Use a compact base model (DistilGPT-2) to keep compute requirements reasonable
  • Build a clean, unified corpus from Austen’s works
  • Fine-tune just enough to imprint style without degrading coherence

Customisation Options

  • Prompt-based generation allows experimenting with tone and narrative direction
  • Training parameters (epochs, batch size, learning rate) are adjustable for different stylistic intensity
  • Corpus pipeline can be reused for other authors or domains

Architecture Notes

  • Data ingestion is automated from Project Gutenberg
  • Cleaning pipeline removes boilerplate and inconsistencies
  • Text is chunked and tokenised for efficient training
  • Training uses Hugging Face’s Trainer API with GPU acceleration when available

A key insight during development was that less training often produced better results. Over-training caused the model to become overly rigid and repetitive, while lighter fine-tuning preserved fluency and added just enough stylistic flavour.


Developer Notes

Building and Running

  • The entire workflow is packaged in a single notebook for simplicity
  • Running it in Google Colab avoids local setup friction
  • Execution is linear: download → clean → preprocess → train → generate

Testing

  • Outputs were evaluated manually using prompt-based sampling
  • Early tests revealed common failure modes like repetition and structural drift
  • Iterative tweaks focused on balancing coherence with stylistic imitation

Extensibility

  • The pipeline is reusable for other authors or niche corpora
  • Swapping datasets requires minimal changes
  • Model size can be scaled depending on available hardware

Behind the Scenes

  • The initial attempt to train from scratch failed fast, which helped define the project direction early
  • Cleaning the Gutenberg texts took more effort than expected due to inconsistent formatting across files
  • The most noticeable improvement came not from architecture changes, but from better preprocessing
  • Fine-tuning felt less like “training a model” and more like “nudging it toward a personality”

This project is less about achieving state-of-the-art performance and more about understanding practical constraints in applied machine learning. It reflects a shift from idealistic design to workable engineering decisions.