Training
Pre-train a model from scratch. Start with five knobs that matter, then reach for memory levers, optimizers, and the research menu only when you need them. Every flag here is real; the full list is `smedjan train --help`.
A real training command
smedjan train \ --dataset train.bin --tokenizer tokenizer.bin \ --size medium \ --batch-size 4 --seq-len 512 \ --steps 50000 --lr 3e-4 --warmup 2000 \ --lr-schedule cosine \ --checkpoint-dir checkpoints/
| Flag | Default | What it does |
|---|---|---|
--size | small | Preset tiny · small · medium · large · max, or custom. |
--batch-size | 32 | Sequences per step. Lower it first if you run out of memory. |
--seq-len | 256 | Tokens per sequence (context length during training). |
--steps | 50000 | Total optimizer steps. |
--lr | 3e-4 | Peak learning rate. |
--warmup | 2000 | Steps to ramp the LR up before the schedule decays it. |
--lr-schedule | cosine | cosine · wsd · wso · invsqrt · trapezoid. |
--checkpoint-dir | checkpoints | Where checkpoints are written. |
Custom architectures
# a fully custom architecture smedjan train … --size custom \ --dim 1024 --layers 16 --heads 16 --ffn-mult 4 \ --kv-heads 4 # Grouped-Query Attention
Running out of memory?
These compose. Reach for them in roughly this order — gradient accumulation and a smaller batch first, then gradient checkpointing, then the fused cross-entropy and FP16 activations.
# fit a larger effective batch / longer context on limited memory smedjan train … \ --grad-accum 8 \ # effective batch = batch-size × 8 --gradient-checkpointing \ # ~60% less activation memory, ~2× compute --fused-ce \ # chunked logits — saves ~2 GB peak --fp16-activations # halve inter-layer activation memory
| Flag | Default | What it does |
|---|---|---|
--grad-accum | 1 | Accumulate N micro-batches per step. Effective batch = batch-size × N. |
--gradient-checkpointing | off | Recompute activations in backward; ~60% less activation memory, ~2× compute. |
--fused-ce | off | Compute logits in chunks; saves ~2 GB peak. |
--fp16-activations | off | Store inter-layer activations in FP16; roughly halves that memory. |
Resume & validate
# validate + early-stop on a held-out set smedjan train … --val-dataset val.bin # resume exactly (model + optimizer + step) after a stop or crash smedjan train … --resume checkpoints/state_25000.bin
With --val-dataset the trainer evaluates on the held-out set at each checkpoint interval and can stop early after several intervals without improvement.
Optimizers
Set with --optimizer. AdamW is the safe default. Muon (and the hybrid that routes Muon to 2-D matrices and AdamW to embeddings/head/norms) often converges faster on the hidden matrices; --normuon and --cautious are near-free add-ons on top.
| Flag | Default | What it does |
|---|---|---|
--optimizer | adamw | adamw · adamw-cpu · sophia · muon · hybrid/muon-adamw · adamw-8bit. |
--normuon | off | Per-neuron second-moment normalization of the Muon update (~+11%). |
--cautious | off | Mask Muon update components that disagree in sign with the gradient. |
--adamw-beta2 | 0.95 | Second-moment decay; pairs with the hardened eps=1e-5. |
The research menu
Smedjan carries a wide menu of architectural and training research. Swap the sequence mixer, route through experts, extend context, or train ternary weights — all from the same binary.
# O(N) sequence mixers instead of softmax attention smedjan train … --ssm # Mamba-2 / SSD smedjan train … --linear-attn # linear (kernel) attention smedjan train … --linear-attn-period 4 # hybrid: every 4th layer linear, rest softmax # Mixture-of-Experts feed-forward smedjan train … --n-experts 8 --top-k-experts 2 --z-loss 1e-4 # context extension & memory smedjan train … --yarn-scale 2.0 # extend context ~2× (NTK-by-parts) smedjan train … --sliding-window 1024 # attend only the last 1024 tokens
Also available: --mla-latent-dim (latent-attention KV-cache shrink), --block-sparse-top-k, --mup-base (µP transfer), --n-predict (multi-token prediction), --curriculum, --dropout, --stochastic-depth, --bitnet, --lowrank / --relora-interval, --ema-decay, --noise-scale, --shared-layers, --freeze-fraction. See the CLI reference.
Real state: every mixer trains — --ssm, --linear-attn, --rwkv (numerically-stable WKV, converges at long sequence), and --block-sparse-top-k (trains like dense). Block-sparse runs with a step-level pool bypass, so its sub-quadratic speedup is a perf follow-up, not a correctness gap.
Growing & merging
smedjan grow expands a small trained checkpoint into a larger architecture so you can continue training it (progressive growth); smedjan merge averages several checkpoints (weight-space averaging, a small benchmark gain). Both are covered under Quantize & export.