// guide

Training

Pre-train a model from scratch. Start with five knobs that matter, then reach for memory levers, optimizers, and the research menu only when you need them. Every flag here is real; the full list is `smedjan train --help`.

// 01

A real training command

the five knobs that matter

smedjan train \
  --dataset train.bin --tokenizer tokenizer.bin \
  --size medium \
  --batch-size 4 --seq-len 512 \
  --steps 50000 --lr 3e-4 --warmup 2000 \
  --lr-schedule cosine \
  --checkpoint-dir checkpoints/

Flag	Default	What it does
`--size`	small	Preset tiny · small · medium · large · max, or custom.
`--batch-size`	32	Sequences per step. Lower it first if you run out of memory.
`--seq-len`	256	Tokens per sequence (context length during training).
`--steps`	50000	Total optimizer steps.
`--lr`	3e-4	Peak learning rate.
`--warmup`	2000	Steps to ramp the LR up before the schedule decays it.
`--lr-schedule`	cosine	cosine · wsd · wso · invsqrt · trapezoid.
`--checkpoint-dir`	checkpoints	Where checkpoints are written.

// 02

Custom architectures

when a preset isn't the shape you want

# a fully custom architecture
smedjan train … --size custom \
  --dim 1024 --layers 16 --heads 16 --ffn-mult 4 \
  --kv-heads 4                 # Grouped-Query Attention

// 03

Running out of memory?

trade compute and precision for room

These compose. Reach for them in roughly this order — gradient accumulation and a smaller batch first, then gradient checkpointing, then the fused cross-entropy and FP16 activations.

# fit a larger effective batch / longer context on limited memory
smedjan train … \
  --grad-accum 8 \             # effective batch = batch-size × 8
  --gradient-checkpointing \   # ~60% less activation memory, ~2× compute
  --fused-ce \                 # chunked logits — saves ~2 GB peak
  --fp16-activations           # halve inter-layer activation memory

Flag	Default	What it does
`--grad-accum`	1	Accumulate N micro-batches per step. Effective batch = batch-size × N.
`--gradient-checkpointing`	off	Recompute activations in backward; ~60% less activation memory, ~2× compute.
`--fused-ce`	off	Compute logits in chunks; saves ~2 GB peak.
`--fp16-activations`	off	Store inter-layer activations in FP16; roughly halves that memory.

// 04

Resume & validate

never lose a run

# validate + early-stop on a held-out set
smedjan train … --val-dataset val.bin

# resume exactly (model + optimizer + step) after a stop or crash
smedjan train … --resume checkpoints/state_25000.bin

With --val-dataset the trainer evaluates on the held-out set at each checkpoint interval and can stop early after several intervals without improvement.

// 05

Optimizers

AdamW by default, Muon when it pays

Set with --optimizer. AdamW is the safe default. Muon (and the hybrid that routes Muon to 2-D matrices and AdamW to embeddings/head/norms) often converges faster on the hidden matrices; --normuon and --cautious are near-free add-ons on top.

Flag	Default	What it does
`--optimizer`	adamw	adamw · adamw-cpu · sophia · muon · hybrid/muon-adamw · adamw-8bit.
`--normuon`	off	Per-neuron second-moment normalization of the Muon update (~+11%).
`--cautious`	off	Mask Muon update components that disagree in sign with the gradient.
`--adamw-beta2`	0.95	Second-moment decay; pairs with the hardened eps=1e-5.

// 06

The research menu

explore — some paths are experimental

Smedjan carries a wide menu of architectural and training research. Swap the sequence mixer, route through experts, extend context, or train ternary weights — all from the same binary.

# O(N) sequence mixers instead of softmax attention
smedjan train … --ssm                    # Mamba-2 / SSD
smedjan train … --linear-attn            # linear (kernel) attention
smedjan train … --linear-attn-period 4   # hybrid: every 4th layer linear, rest softmax

# Mixture-of-Experts feed-forward
smedjan train … --n-experts 8 --top-k-experts 2 --z-loss 1e-4

# context extension & memory
smedjan train … --yarn-scale 2.0         # extend context ~2× (NTK-by-parts)
smedjan train … --sliding-window 1024    # attend only the last 1024 tokens

Also available: --mla-latent-dim (latent-attention KV-cache shrink), --block-sparse-top-k, --mup-base (µP transfer), --n-predict (multi-token prediction), --curriculum, --dropout, --stochastic-depth, --bitnet, --lowrank / --relora-interval, --ema-decay, --noise-scale, --shared-layers, --freeze-fraction. See the CLI reference.

Real state: every mixer trains — --ssm, --linear-attn, --rwkv (numerically-stable WKV, converges at long sequence), and --block-sparse-top-k (trains like dense). Block-sparse runs with a step-level pool bypass, so its sub-quadratic speedup is a perf follow-up, not a correctness gap.

// 07

Growing & merging

reuse what you trained

smedjan grow expands a small trained checkpoint into a larger architecture so you can continue training it (progressive growth); smedjan merge averages several checkpoints (weight-space averaging, a small benchmark gain). Both are covered under Quantize & export.

Fine-tuning & alignment →Adapt the base model you just trained.Performance & tuning →Throughput numbers and configs by hardware.CLI reference →The complete train flag list.