// reference

Performance & tuning

What Smedjan actually does on real hardware, what made it fast, and the levers that trade memory for speed. Numbers are measured, not extrapolated.

// 01

Measured throughput

Apple M1 Mac mini · 16 GB · batch 4 · seq 128

Preset	Inference (fwd)	Decode (1 tok)	Train (fwd+bwd)
small (7.2M · d256/6L)	22,900 tok/s	173 tok/s	4,400 tok/s
medium (45M · d512/12L)	5,090 tok/s	65 tok/s	1,150 tok/s

Real throughput from smedjan bench, not theoretical peaks — reproduce with smedjan bench --size <preset>.

// 02

the Metal optimization pass

Hardware simdgroup-MMA matmul — on by default, bit-identical. On the medium model it runs ~1.3–1.4× the scalar fallback (training 785 → 1,150 tok/s, inference 3,600 → 5,090 tok/s).
Batched matmul shaders that collapse the many small GPU dispatches per attention layer into a few.
FP16 mixed precision with float accumulators (clamped FP16 casts to avoid overflow).
A merged forward + backward GPU command batch.
Single-instruction RoPE sincos.

// 03

the levers

When you hit a memory ceiling, pull these in order. Each frees memory; the first two cost compute.

Flag	Default	What it does
`--batch-size ↓ / --seq-len ↓`	—	The bluntest, cheapest reduction.
`--grad-accum N`	1	Keep a large effective batch with a small real one.
`--gradient-checkpointing`	off	~60% less activation memory for ~2× compute.
`--fused-ce`	off	Chunked logits; ~2 GB peak saved.
`--fp16-activations`	off	Halve inter-layer activation memory.

// 04

the bench command

smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20
# add --simdgroup-matmul to measure the hardware MMA fast path

It reports inference and training throughput with detailed metrics, so you can size a run before committing to it.

// 05

by hardware

Flag	Default	What it does
`8 GB Apple Silicon`	—	--size small, --batch-size 4–8, --seq-len 256, add --grad-accum to grow the effective batch.
`16 GB Apple Silicon`	—	--size medium, --batch-size 8, --seq-len 512.
`Large CUDA GPU`	—	Scale --size up and --batch-size with VRAM; --gradient-checkpointing for the biggest models.

Always build --release. A debug build is many times slower and not representative of anything.