// reference

Performance & tuning

What Smedjan actually does on real hardware, what made it fast, and the levers that trade memory for speed. Numbers are measured, not extrapolated.

// 01

Measured throughput

Apple M1 Mac mini · 16 GB · batch 4 · seq 128
PresetInference (fwd)Decode (1 tok)Train (fwd+bwd)
small (7.2M · d256/6L)22,900 tok/s173 tok/s4,400 tok/s
medium (45M · d512/12L)5,090 tok/s65 tok/s1,150 tok/s

Real throughput from smedjan bench, not theoretical peaks — reproduce with smedjan bench --size <preset>.

// 02

What made it fast

the Metal optimization pass
  • Hardware simdgroup-MMA matmul — on by default, bit-identical. On the medium model it runs ~1.3–1.4× the scalar fallback (training 785 → 1,150 tok/s, inference 3,600 → 5,090 tok/s).
  • Batched matmul shaders that collapse the many small GPU dispatches per attention layer into a few.
  • FP16 mixed precision with float accumulators (clamped FP16 casts to avoid overflow).
  • A merged forward + backward GPU command batch.
  • Single-instruction RoPE sincos.
// 03

Memory vs. speed

the levers

When you hit a memory ceiling, pull these in order. Each frees memory; the first two cost compute.

FlagDefaultWhat it does
--batch-size ↓ / --seq-len ↓The bluntest, cheapest reduction.
--grad-accum N1Keep a large effective batch with a small real one.
--gradient-checkpointingoff~60% less activation memory for ~2× compute.
--fused-ceoffChunked logits; ~2 GB peak saved.
--fp16-activationsoffHalve inter-layer activation memory.
// 04

Benchmark your own box

the bench command
smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20
# add --simdgroup-matmul to measure the hardware MMA fast path

It reports inference and training throughput with detailed metrics, so you can size a run before committing to it.

// 05

Sensible starting points

by hardware
FlagDefaultWhat it does
8 GB Apple Silicon--size small, --batch-size 4–8, --seq-len 256, add --grad-accum to grow the effective batch.
16 GB Apple Silicon--size medium, --batch-size 8, --seq-len 512.
Large CUDA GPUScale --size up and --batch-size with VRAM; --gradient-checkpointing for the biggest models.

Always build --release. A debug build is many times slower and not representative of anything.