OWN THE STACK.ZERO PYTHON.
The smithy for language models. Pure Rust, your hardware, your weights.
Every headline above is written, right now, by a 295K-parameter Smedjan model compiled to WebAssembly and running on your machine — no server, no API.
cargo install smedjanOwn the stack
Every other path to training your own model drags in PyTorch, a CUDA-only stack, a Python runtime, and cloud lock-in. Smedjan is the opposite: you own the entire stack, end to end, in one language.
Own the stack
Every line, every GPU kernel, every byte of the model lives in one repo. No framework you don't control sits between you and the metal.
Zero Python, zero PyTorch
The entire dependency tree is four small crates — clap, rand, memmap2, byteorder — plus the GPU FFI bindings. Nothing else to install, audit, or trust.
Two backends, one codebase
Metal on Apple Silicon, CUDA on NVIDIA, selected at compile time. Checkpoints are portable across both — train on a Mac, resume on an H100.
From scratch, but complete
Not a toy. Tokenizer → pretrain → distill → SFT → DPO → quantize → export → serve. The whole pipeline, in one language.
From raw text to a served checkpoint
One binary, one tokenizer, one checkpoint format. No glue scripts, no handoff between frameworks, no step that secretly needs Python.
Architecture
The model
Decoder-only transformer, pre-norm (RMSNorm), Rotary Position Embeddings with NTK-aware and YaRN scaling, SwiGLU feed-forward, and an lm_head weight-tied to the embedding.
Attention
Multi-Head or Grouped-Query (GQA) via --kv-heads, with Flash Attention on both backends. Plus a research menu of alternative mixers — see below.
Sizes
Presets from tiny (2M) through 6.5B, or fully custom: --size custom --dim --layers --heads.
The mixer menu
Swap the sequence mixer without leaving the codebase. Pick one and watch how it attends — the experimental ones are labelled. Flash Attention backs the softmax path on both backends; Mixture-of-Experts routing lives in the feed-forward.
Multi-Head / GQA
Grouped-Query attention via --kv-heads — the default, production path, with Flash Attention on both backends.
scores = (q @ kᵀ) * scale scores += causal_mask p = softmax(scores) out = p @ v
Performance
| Preset | Inference (fwd) | Decode (1 tok) | Train (fwd+bwd) |
|---|---|---|---|
| small · 7.2M | 22,900 tok/s | 173 tok/s | 4,400 tok/s |
| medium · 45M | 5,090 tok/s | 65 tok/s | 1,150 tok/s |
Batch 4, sequence length 128 — real throughput, not theoretical peaks. The hardware simdgroup-MMA matmul (on by default) runs about 1.3–1.4× the scalar fallback; the rest of the Metal pass is batched matmul shaders, FP16 mixed precision with float accumulators, a merged forward+backward command batch, and single-instruction RoPE sincos. Reproduce with smedjan bench --size <preset>.
Quick start
# clone and build (Metal is the default on macOS) git clone https://github.com/smedjan/smedjan.git cd smedjan && cargo build --release # → ./target/release/smedjan # or, on NVIDIA cargo build --release --no-default-features --features cuda # train a BPE tokenizer, prepare data, train a model smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tok.bin smedjan prepare --input corpus.txt --tokenizer tok.bin --output train.bin smedjan train --dataset train.bin --tokenizer tok.bin --size medium --stream
Forge it with me
The ground rules
No heavyweight dependencies. New GPU kernels ship with gradient checks. Everything is verified against real hardware — the test suite runs on the GPU, not a CPU stand-in. Keep the stack ownable.
Where it's headed
Faithful HF inference parity (config.json → model, bf16 loading), CUDA backward parity for the remaining specialized kernels, chunked O(N) RWKV forward, and longer-context evaluation. Open lanes, all of them.
This is one engineer's engine. If you want to understand and own ML infrastructure down to the kernel, there's room to build here.
Back the forge
Sponsor
Smedjan is free, MIT-licensed, and built without a company behind it. Sponsorship buys hardware hours and the time to keep building — no Python, no cloud, no strings.