// guide

Training

Pre-train a model from scratch. Start with five knobs that matter, then reach for memory levers, optimizers, and the research menu only when you need them. Every flag here is real; the full list is `smedjan train --help`.

// 01

A real training command

the five knobs that matter
smedjan train \
  --dataset train.bin --tokenizer tokenizer.bin \
  --size medium \
  --batch-size 4 --seq-len 512 \
  --steps 50000 --lr 3e-4 --warmup 2000 \
  --lr-schedule cosine \
  --checkpoint-dir checkpoints/
FlagDefaultWhat it does
--sizesmallPreset tiny · small · medium · large · max, or custom.
--batch-size32Sequences per step. Lower it first if you run out of memory.
--seq-len256Tokens per sequence (context length during training).
--steps50000Total optimizer steps.
--lr3e-4Peak learning rate.
--warmup2000Steps to ramp the LR up before the schedule decays it.
--lr-schedulecosinecosine · wsd · wso · invsqrt · trapezoid.
--checkpoint-dircheckpointsWhere checkpoints are written.
// 02

Custom architectures

when a preset isn't the shape you want
# a fully custom architecture
smedjan train … --size custom \
  --dim 1024 --layers 16 --heads 16 --ffn-mult 4 \
  --kv-heads 4                 # Grouped-Query Attention
// 03

Running out of memory?

trade compute and precision for room

These compose. Reach for them in roughly this order — gradient accumulation and a smaller batch first, then gradient checkpointing, then the fused cross-entropy and FP16 activations.

# fit a larger effective batch / longer context on limited memory
smedjan train … \
  --grad-accum 8 \             # effective batch = batch-size × 8
  --gradient-checkpointing \   # ~60% less activation memory, ~2× compute
  --fused-ce \                 # chunked logits — saves ~2 GB peak
  --fp16-activations           # halve inter-layer activation memory
FlagDefaultWhat it does
--grad-accum1Accumulate N micro-batches per step. Effective batch = batch-size × N.
--gradient-checkpointingoffRecompute activations in backward; ~60% less activation memory, ~2× compute.
--fused-ceoffCompute logits in chunks; saves ~2 GB peak.
--fp16-activationsoffStore inter-layer activations in FP16; roughly halves that memory.
// 04

Resume & validate

never lose a run
# validate + early-stop on a held-out set
smedjan train … --val-dataset val.bin

# resume exactly (model + optimizer + step) after a stop or crash
smedjan train … --resume checkpoints/state_25000.bin

With --val-dataset the trainer evaluates on the held-out set at each checkpoint interval and can stop early after several intervals without improvement.

// 05

Optimizers

AdamW by default, Muon when it pays

Set with --optimizer. AdamW is the safe default. Muon (and the hybrid that routes Muon to 2-D matrices and AdamW to embeddings/head/norms) often converges faster on the hidden matrices; --normuon and --cautious are near-free add-ons on top.

FlagDefaultWhat it does
--optimizeradamwadamw · adamw-cpu · sophia · muon · hybrid/muon-adamw · adamw-8bit.
--normuonoffPer-neuron second-moment normalization of the Muon update (~+11%).
--cautiousoffMask Muon update components that disagree in sign with the gradient.
--adamw-beta20.95Second-moment decay; pairs with the hardened eps=1e-5.
// 06

The research menu

explore — some paths are experimental

Smedjan carries a wide menu of architectural and training research. Swap the sequence mixer, route through experts, extend context, or train ternary weights — all from the same binary.

# O(N) sequence mixers instead of softmax attention
smedjan train … --ssm                    # Mamba-2 / SSD
smedjan train … --linear-attn            # linear (kernel) attention
smedjan train … --linear-attn-period 4   # hybrid: every 4th layer linear, rest softmax

# Mixture-of-Experts feed-forward
smedjan train … --n-experts 8 --top-k-experts 2 --z-loss 1e-4

# context extension & memory
smedjan train … --yarn-scale 2.0         # extend context ~2× (NTK-by-parts)
smedjan train … --sliding-window 1024    # attend only the last 1024 tokens

Also available: --mla-latent-dim (latent-attention KV-cache shrink), --block-sparse-top-k, --mup-base (µP transfer), --n-predict (multi-token prediction), --curriculum, --dropout, --stochastic-depth, --bitnet, --lowrank / --relora-interval, --ema-decay, --noise-scale, --shared-layers, --freeze-fraction. See the CLI reference.

Real state: every mixer trains — --ssm, --linear-attn, --rwkv (numerically-stable WKV, converges at long sequence), and --block-sparse-top-k (trains like dense). Block-sparse runs with a step-level pool bypass, so its sub-quadratic speedup is a perf follow-up, not a correctness gap.

// 07

Growing & merging

reuse what you trained

smedjan grow expands a small trained checkpoint into a larger architecture so you can continue training it (progressive growth); smedjan merge averages several checkpoints (weight-space averaging, a small benchmark gain). Both are covered under Quantize & export.