Oct 2025 – Nov 2025

Jane Austen SLM

Fine-tuned a miniature language model to emulate Jane Austen’s writing style using custom literary datasets.

PythonPyTorchJupyter Notebook

Overview

Jane Austen SLM is a focused experiment in style-driven text generation. The project fine-tunes a compact language model on Jane Austen’s complete works to produce original prose that echoes her tone, cadence, and narrative style. It began as an attempt to build a small model from scratch, but evolved into a more practical fine-tuning pipeline after running into real-world data limitations.

Problem

The original goal was straightforward but unrealistic: train a language model purely on Austen’s writing and have it generate convincing 19th-century prose.

That ran into a hard constraint quickly. Even after aggregating all available texts, the dataset was too small to support training a model from scratch. Early experiments led to rapid overfitting and repetitive outputs. The model would memorise patterns instead of learning language structure.

So the problem shifted from “how to train a model on Austen” to “how to inject Austen’s style into a model that already understands English.”

Solution

The project pivots to fine-tuning a lightweight pre-trained model instead of building one from zero.

Design Approach

Use a compact base model (DistilGPT-2) to keep compute requirements reasonable
Build a clean, unified corpus from Austen’s works
Fine-tune just enough to imprint style without degrading coherence

Customisation Options

Prompt-based generation allows experimenting with tone and narrative direction
Training parameters (epochs, batch size, learning rate) are adjustable for different stylistic intensity
Corpus pipeline can be reused for other authors or domains

Architecture Notes

Data ingestion is automated from Project Gutenberg
Cleaning pipeline removes boilerplate and inconsistencies
Text is chunked and tokenised for efficient training
Training uses Hugging Face’s Trainer API with GPU acceleration when available

A key insight during development was that less training often produced better results. Over-training caused the model to become overly rigid and repetitive, while lighter fine-tuning preserved fluency and added just enough stylistic flavour.

Developer Notes

Building and Running

The entire workflow is packaged in a single notebook for simplicity
Running it in Google Colab avoids local setup friction
Execution is linear: download → clean → preprocess → train → generate

Testing

Outputs were evaluated manually using prompt-based sampling
Early tests revealed common failure modes like repetition and structural drift
Iterative tweaks focused on balancing coherence with stylistic imitation

Extensibility

The pipeline is reusable for other authors or niche corpora
Swapping datasets requires minimal changes
Model size can be scaled depending on available hardware

Behind the Scenes

The initial attempt to train from scratch failed fast, which helped define the project direction early
Cleaning the Gutenberg texts took more effort than expected due to inconsistent formatting across files
The most noticeable improvement came not from architecture changes, but from better preprocessing
Fine-tuning felt less like “training a model” and more like “nudging it toward a personality”

This project is less about achieving state-of-the-art performance and more about understanding practical constraints in applied machine learning. It reflects a shift from idealistic design to workable engineering decisions.