Fine-tuning & alignment
A pre-trained model continues text. To make it follow instructions or match a style, fine-tune and align it — SFT, then DPO — all with the same binary and the same checkpoint format. No separate framework, no reward model to train.
SFT — supervised fine-tuning
Start from a pre-trained checkpoint and train on instruction/response pairs in JSONL — one object per line:
{"prompt": "Translate to French: good morning", "response": "bonjour"}
{"prompt": "Capital of Sweden?", "response": "Stockholm"}Then run SFT. The defaults (low LR, short schedule) are tuned for adaptation, not from-scratch training:
smedjan sft \ --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \ --data instructions.jsonl \ --steps 1000 --lr 2e-5 --batch-size 8 --seq-len 256 \ --output-dir sft_checkpoints/
| Flag | Default | What it does |
|---|---|---|
--checkpoint | — | Pre-trained model to fine-tune from. |
--data | — | JSONL with {"prompt", "response"} per line. |
--lr | 2e-5 | Fine-tuning learning rate (much lower than pre-training). |
--steps | 1000 | Optimizer steps. |
--batch-size | 8 | Sequences per step. |
--output-dir | sft_checkpoints | Where the fine-tuned checkpoints land. |
No instruction data yet? smedjan sft-prepare converts tab-separated or alternating-line pairs (e.g. NL2Bash) into the JSONL format.
DPO — preference alignment
Direct Preference Optimization aligns the model using chosen vs rejected answer pairs — lighter than RLHF, with no separate reward model. Data is JSONL:
{"prompt": "Explain gravity", "chosen": "a clear, correct answer", "rejected": "a vague or wrong one"}# 1 — convert preference pairs to the binary format smedjan dpo-prepare --input prefs.jsonl --output prefs.bin --tokenizer tokenizer.bin # 2 — align the SFT model against a frozen reference (usually the same checkpoint) smedjan dpo \ --checkpoint sft_checkpoints/final.bin \ --ref-checkpoint sft_checkpoints/final.bin \ --tokenizer tokenizer.bin --dataset prefs.bin \ --beta 0.1 --lr 1e-6 --output-dir dpo_checkpoints/
| Flag | Default | What it does |
|---|---|---|
--checkpoint | — | Policy model to update (usually your SFT checkpoint). |
--ref-checkpoint | — | Frozen reference anchor — typically the same checkpoint. |
--beta | 0.1 | DPO temperature; lower is more conservative. |
--lr | 1e-6 | Alignment learning rate (very low). |
Distillation
Knowledge distillation trains a small student against a local teacher checkpoint, blending KL divergence with cross-entropy. It is a flag on train, not a separate command:
# knowledge distillation from a local teacher checkpoint (KL + CE) smedjan train \ --dataset train.bin --tokenizer tokenizer.bin --size small \ --teacher-checkpoint big_teacher.bin \ --distill-temperature 4.0 --distill-alpha 0.5
Synthetic-data distillation is the distill subcommand: it generates instruction data by calling an external model (Ollama locally, or a Claude/OpenAI-style endpoint) and writes JSONL you then SFT on:
# generate instruction data from an external model (Ollama by default) smedjan distill \ --api-url http://localhost:11434/api/generate --model qwen2.5:7b \ --n-samples 1000 --max-tokens 512 --output synth.jsonl # …then SFT on what it produced smedjan sft --checkpoint base.bin --tokenizer tokenizer.bin --data synth.jsonl
Don't confuse them: train --teacher-checkpoint distils from weights you already have; the distill command distils from an API into a dataset.
A typical chain
Pre-train → SFT → DPO, each reading the previous stage's checkpoint: train → sft → dpo. Then quantize and export for serving — see Quantize & export.