
Local-CoT
A framework for fine-tuning and running a chain-of-thought model prompting locally on a MacBook without relying on external APIs.
Overview
A compact research project exploring how small language models can be taught to “think out loud” while solving maths problems. The focus was on fine-tuning Microsoft’s Phi-3 model to produce step-by-step reasoning using Chain-of-Thought (CoT), and validating whether that reasoning actually improves accuracy rather than just verbosity. Built and tested end-to-end on an Apple Silicon laptop, this project doubles as both an experimentation sandbox and a reproducible pipeline.
Problem
Out-of-the-box small models tend to jump straight to answers, often getting them wrong with no visibility into why. For something like grade-school maths, that is a limitation. The need here was twofold:
- Improve answer accuracy on structured reasoning tasks
- Make the model’s reasoning process explicit and inspectable
There is also a practical constraint. Most research setups assume access to large GPUs. This project explores whether meaningful gains can be achieved on consumer hardware without cutting too many corners.
Solution
The approach was to fine-tune Phi-3 on the GSM8K dataset using Chain-of-Thought style responses, where each answer includes intermediate reasoning steps followed by a final answer block.
Design-wise, the pipeline was broken into clear stages:
- Data preparation that reformats GSM8K into conversational CoT prompts
- Lightweight LoRA-based fine-tuning to stay within memory limits
- Evaluation scripts that check both correctness and reasoning consistency
- Optional self-consistency decoding to improve reliability by sampling multiple reasoning paths
A few practical decisions shaped the project:
- LoRA over full fine-tuning to keep training feasible on an M4 Air
- Longer sequence lengths to allow full reasoning traces
- Mixed training option to prevent the model from becoming overly verbose
- Self-consistency decoding as a cheap way to boost performance without retraining
The result is a modular setup where experimentation is straightforward. You can swap training strategies, adjust parameters, or test inference tricks without rewriting the pipeline.
Developer Notes
This project was built with iteration speed in mind rather than perfection.
- The pipeline script (
run_pipeline.py) exists because manually running each stage got tedious very quickly. Automating it made debugging far easier. - Early runs failed silently due to formatting issues in the dataset. A lot of time went into making the data structure predictable before training even worked.
- Thermal throttling on the M4 Air was a real constraint. Training sessions had to be split, and parameter choices were often dictated by hardware limits rather than theory.
- Self-consistency looked trivial on paper but added noticeable overhead in practice. It forced a trade-off between accuracy and runtime that had to be tuned manually.
- Evaluation was more revealing than training. Many outputs looked correct in reasoning but failed in the final answer, which led to adding explicit parsing and error categorisation.
To build and run:
- Install dependencies from
requirements.txt - Run the validation phase first to confirm the environment works
- Use phased execution to avoid wasting time on full training runs if something breaks
- Test inference separately before trusting evaluation metrics
Extending the project is straightforward:
- Swap in a different model with minimal changes
- Add new decoding strategies in the inference module
- Plug in alternative datasets for broader reasoning tasks
Overall, the project is less about squeezing maximum accuracy and more about understanding how reasoning emerges, breaks, and can be guided in smaller models under real-world constraints.