// reference

Troubleshooting & FAQ

Common failures and how to clear them, a plain account of what is and isn't done yet, and quick answers.

// 01

Troubleshooting

the usual suspects

Loss is NaN or diverging

Gradient clipping with NaN/Inf detection is on by default. If it still blows up: lower --lr, increase --warmup, and check your data. --bf16-matmul exists for genuine FP16 overflow, but it has coarser precision and can destabilize an otherwise-healthy run — reach for it last.

Out of memory

In order: lower --batch-size / --seq-len, add --grad-accum, turn on --gradient-checkpointing, then --fused-ce and --fp16-activations. See Performance & tuning.

Throughput is terrible

Build with cargo build --release. The hardware simdgroup matmul is on by default; a debug build or a fallback path is many times slower.

CUDA build

Install the CUDA toolkit (12.x) and build with --no-default-features --features cuda. The Metal path is the default on macOS and needs no setup.

// 02

Limitations & roadmap

what's done, what isn't

Smedjan is one engineer's engine, and it says so. The honest state:

  • safetensors import reads F32, BF16, and F16, and config.json maps straight to a Smedjan model via import-hf. Export works too.
  • GGUF export covers f32, q8_0, and q4_0 as standard GGML blocks (norms stay f32).
  • Faithful bit-exact HuggingFace inference parity is still on the roadmap. The config.json → model + BF16/F16 import path works for continued training; reproducing HF inference to the bit (half-split RoPE, fixed QK-norm) is a separate, deliberate divergence that continued training adapts away.
  • RWKV and block-sparse both train. The RWKV WKV now uses a numerically-stable decay form and converges at long sequence; block-sparse trains like dense. --ssm and --linear-attn train as well.
  • CUDA backward parity for a few specialized kernels is still being completed; the Metal path is the most exercised.
  • Long-context evaluation (NIAH / RULER) ships — run smedjan eval --longctx. The strength of the curve depends on how well the model was trained.
// 03

FAQ

quick answers

Does it need Python or PyTorch?

No. The entire dependency tree is a handful of small crates (clap, rand, memmap2, byteorder) plus the GPU FFI bindings.

Can I move a checkpoint between Metal and CUDA?

Yes — the checkpoint format is portable across both backends. Train on a Mac, resume on an NVIDIA GPU.

What's the smallest useful model?

tiny (~2M) or small are perfect for proving the pipeline and for tiny on-device models. Scale up from there.

Can I export to GGUF / run under llama.cpp?

Export standard GGML weights with smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q4_0 (f32, q8_0, or q4_0). The blocks are validated against the reference GGUF dequantizer, but a Smedjan checkpoint is not yet a turnkey llama.cpp inference model — the tokenizer isn't embedded and the RoPE/QK-norm conventions differ. Direct llama.cpp inference is on the roadmap.

License?

MIT. Own it, fork it, ship it.