// reference
Performance & tuning
What Smedjan actually does on real hardware, what made it fast, and the levers that trade memory for speed. Numbers are measured, not extrapolated.
// 01
Measured throughput
| Preset | Inference (fwd) | Decode (1 tok) | Train (fwd+bwd) |
|---|---|---|---|
| small (7.2M · d256/6L) | 22,900 tok/s | 173 tok/s | 4,400 tok/s |
| medium (45M · d512/12L) | 5,090 tok/s | 65 tok/s | 1,150 tok/s |
Real throughput from smedjan bench, not theoretical peaks — reproduce with smedjan bench --size <preset>.
// 02
What made it fast
- Hardware simdgroup-MMA matmul — on by default, bit-identical. On the medium model it runs ~1.3–1.4× the scalar fallback (training 785 → 1,150 tok/s, inference 3,600 → 5,090 tok/s).
- Batched matmul shaders that collapse the many small GPU dispatches per attention layer into a few.
- FP16 mixed precision with float accumulators (clamped FP16 casts to avoid overflow).
- A merged forward + backward GPU command batch.
- Single-instruction RoPE sincos.
// 03
Memory vs. speed
When you hit a memory ceiling, pull these in order. Each frees memory; the first two cost compute.
| Flag | Default | What it does |
|---|---|---|
--batch-size ↓ / --seq-len ↓ | — | The bluntest, cheapest reduction. |
--grad-accum N | 1 | Keep a large effective batch with a small real one. |
--gradient-checkpointing | off | ~60% less activation memory for ~2× compute. |
--fused-ce | off | Chunked logits; ~2 GB peak saved. |
--fp16-activations | off | Halve inter-layer activation memory. |
// 04
Benchmark your own box
smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20 # add --simdgroup-matmul to measure the hardware MMA fast path
It reports inference and training throughput with detailed metrics, so you can size a run before committing to it.
// 05
Sensible starting points
| Flag | Default | What it does |
|---|---|---|
8 GB Apple Silicon | — | --size small, --batch-size 4–8, --seq-len 256, add --grad-accum to grow the effective batch. |
16 GB Apple Silicon | — | --size medium, --batch-size 8, --seq-len 512. |
Large CUDA GPU | — | Scale --size up and --batch-size with VRAM; --gradient-checkpointing for the biggest models. |
Always build --release. A debug build is many times slower and not representative of anything.