// guide

Fine-tuning & alignment

A pre-trained model continues text. To make it follow instructions or match a style, fine-tune and align it — SFT, then DPO — all with the same binary and the same checkpoint format. No separate framework, no reward model to train.

// 01

SFT — supervised fine-tuning

teach it to respond

Start from a pre-trained checkpoint and train on instruction/response pairs in JSONL — one object per line:

{"prompt": "Translate to French: good morning", "response": "bonjour"}
{"prompt": "Capital of Sweden?", "response": "Stockholm"}

Then run SFT. The defaults (low LR, short schedule) are tuned for adaptation, not from-scratch training:

smedjan sft \
  --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \
  --data instructions.jsonl \
  --steps 1000 --lr 2e-5 --batch-size 8 --seq-len 256 \
  --output-dir sft_checkpoints/
FlagDefaultWhat it does
--checkpointPre-trained model to fine-tune from.
--dataJSONL with {"prompt", "response"} per line.
--lr2e-5Fine-tuning learning rate (much lower than pre-training).
--steps1000Optimizer steps.
--batch-size8Sequences per step.
--output-dirsft_checkpointsWhere the fine-tuned checkpoints land.

No instruction data yet? smedjan sft-prepare converts tab-separated or alternating-line pairs (e.g. NL2Bash) into the JSONL format.

// 02

DPO — preference alignment

nudge toward better answers

Direct Preference Optimization aligns the model using chosen vs rejected answer pairs — lighter than RLHF, with no separate reward model. Data is JSONL:

{"prompt": "Explain gravity", "chosen": "a clear, correct answer", "rejected": "a vague or wrong one"}
# 1 — convert preference pairs to the binary format
smedjan dpo-prepare --input prefs.jsonl --output prefs.bin --tokenizer tokenizer.bin

# 2 — align the SFT model against a frozen reference (usually the same checkpoint)
smedjan dpo \
  --checkpoint sft_checkpoints/final.bin \
  --ref-checkpoint sft_checkpoints/final.bin \
  --tokenizer tokenizer.bin --dataset prefs.bin \
  --beta 0.1 --lr 1e-6 --output-dir dpo_checkpoints/
FlagDefaultWhat it does
--checkpointPolicy model to update (usually your SFT checkpoint).
--ref-checkpointFrozen reference anchor — typically the same checkpoint.
--beta0.1DPO temperature; lower is more conservative.
--lr1e-6Alignment learning rate (very low).
// 03

Distillation

two different things

Knowledge distillation trains a small student against a local teacher checkpoint, blending KL divergence with cross-entropy. It is a flag on train, not a separate command:

# knowledge distillation from a local teacher checkpoint (KL + CE)
smedjan train \
  --dataset train.bin --tokenizer tokenizer.bin --size small \
  --teacher-checkpoint big_teacher.bin \
  --distill-temperature 4.0 --distill-alpha 0.5

Synthetic-data distillation is the distill subcommand: it generates instruction data by calling an external model (Ollama locally, or a Claude/OpenAI-style endpoint) and writes JSONL you then SFT on:

# generate instruction data from an external model (Ollama by default)
smedjan distill \
  --api-url http://localhost:11434/api/generate --model qwen2.5:7b \
  --n-samples 1000 --max-tokens 512 --output synth.jsonl
# …then SFT on what it produced
smedjan sft --checkpoint base.bin --tokenizer tokenizer.bin --data synth.jsonl

Don't confuse them: train --teacher-checkpoint distils from weights you already have; the distill command distils from an API into a dataset.

// 04

A typical chain

base → assistant

Pre-train → SFT → DPO, each reading the previous stage's checkpoint: trainsftdpo. Then quantize and export for serving — see Quantize & export.