Jan 2026 – Present

Local-CoT

Built a local framework for running and fine-tuning reasoning-focused language models without relying on external APIs.

PythonPyTorchScikit-learnJupyter NotebookMLX

Overview

A compact research project exploring how small language models can be taught to “think out loud” while solving maths problems. The focus was on fine-tuning Microsoft’s Phi-3 model to produce step-by-step reasoning using Chain-of-Thought (CoT), and validating whether that reasoning actually improves accuracy rather than just verbosity. Built and tested end-to-end on an Apple Silicon laptop, this project doubles as both an experimentation sandbox and a reproducible pipeline.

Problem

Out-of-the-box small models tend to jump straight to answers, often getting them wrong with no visibility into why. For something like grade-school maths, that is a limitation. The need here was twofold:

Improve answer accuracy on structured reasoning tasks
Make the model’s reasoning process explicit and inspectable

There is also a practical constraint. Most research setups assume access to large GPUs. This project explores whether meaningful gains can be achieved on consumer hardware without cutting too many corners.

Solution

The approach was to fine-tune Phi-3 on the GSM8K dataset using Chain-of-Thought style responses, where each answer includes intermediate reasoning steps followed by a final answer block.

Design-wise, the pipeline was broken into clear stages:

Data preparation that reformats GSM8K into conversational CoT prompts
Lightweight LoRA-based fine-tuning to stay within memory limits
Evaluation scripts that check both correctness and reasoning consistency
Optional self-consistency decoding to improve reliability by sampling multiple reasoning paths

A few practical decisions shaped the project:

LoRA over full fine-tuning to keep training feasible on an M4 Air
Longer sequence lengths to allow full reasoning traces
Mixed training option to prevent the model from becoming overly verbose
Self-consistency decoding as a cheap way to boost performance without retraining

The result is a modular setup where experimentation is straightforward. You can swap training strategies, adjust parameters, or test inference tricks without rewriting the pipeline.

Developer Notes

This project was built with iteration speed in mind rather than perfection.

The pipeline script (run_pipeline.py) exists because manually running each stage got tedious very quickly. Automating it made debugging far easier.
Early runs failed silently due to formatting issues in the dataset. A lot of time went into making the data structure predictable before training even worked.
Thermal throttling on the M4 Air was a real constraint. Training sessions had to be split, and parameter choices were often dictated by hardware limits rather than theory.
Self-consistency looked trivial on paper but added noticeable overhead in practice. It forced a trade-off between accuracy and runtime that had to be tuned manually.
Evaluation was more revealing than training. Many outputs looked correct in reasoning but failed in the final answer, which led to adding explicit parsing and error categorisation.

To build and run:

Install dependencies from requirements.txt
Run the validation phase first to confirm the environment works
Use phased execution to avoid wasting time on full training runs if something breaks
Test inference separately before trusting evaluation metrics

Extending the project is straightforward:

Swap in a different model with minimal changes
Add new decoding strategies in the inference module
Plug in alternative datasets for broader reasoning tasks

Overall, the project is less about squeezing maximum accuracy and more about understanding how reasoning emerges, breaks, and can be guided in smaller models under real-world constraints.