Pure-Rust LLM engine · Metal + CUDA

OWN THE STACK.ZERO PYTHON.

The smithy for language models. Pure Rust, your hardware, your weights.

Every headline above is written, right now, by a 295K-parameter Smedjan model compiled to WebAssembly and running on your machine — no server, no API.

$cargo install smedjan
// 01

Own the stack

the manifesto

Every other path to training your own model drags in PyTorch, a CUDA-only stack, a Python runtime, and cloud lock-in. Smedjan is the opposite: you own the entire stack, end to end, in one language.

Own the stack

Every line, every GPU kernel, every byte of the model lives in one repo. No framework you don't control sits between you and the metal.

Zero Python, zero PyTorch

The entire dependency tree is four small crates — clap, rand, memmap2, byteorder — plus the GPU FFI bindings. Nothing else to install, audit, or trust.

Two backends, one codebase

Metal on Apple Silicon, CUDA on NVIDIA, selected at compile time. Checkpoints are portable across both — train on a Mac, resume on an H100.

From scratch, but complete

Not a toy. Tokenizer → pretrain → distill → SFT → DPO → quantize → export → serve. The whole pipeline, in one language.

// 02

From raw text to a served checkpoint

every stage is a real subcommand
tokenizer
train a BPE tokenizer
prepare
text → binary format
process
filter + dedup
mix
blend datasets
train
pretrain from scratch
distill
KL + CE from a teacher
sft
supervised fine-tune
dpo
preference alignment
quantize
Q4 / Q8 post-training
export-gguf
llama.cpp-compatible
generate
sample · stream · spec-decode

One binary, one tokenizer, one checkpoint format. No glue scripts, no handoff between frameworks, no step that secretly needs Python.

// 03

Architecture

decoder-only transformer
src/
model.rstransformer — presets + custom sizes
attention.rsGQA + Linear · SSM · RWKV · MLA · block-sparse
tensor.rsGPU tensor operations
autograd.rstape-based reverse-mode autodiff
train.rsloop · grad-accum · validation · resume
generate.rssampling · speculative decoding
optim.rsAdamW · Muon / NorMuon · schedulers
quantize.rsQ4/Q8 · GGUF export
safetensors.rssafetensors + HF-Llama interop
backend-agnostic above this line
metal/Apple Silicon · MSL kernels
cuda/NVIDIA · cudarc

The model

Decoder-only transformer, pre-norm (RMSNorm), Rotary Position Embeddings with NTK-aware and YaRN scaling, SwiGLU feed-forward, and an lm_head weight-tied to the embedding.

Attention

Multi-Head or Grouped-Query (GQA) via --kv-heads, with Flash Attention on both backends. Plus a research menu of alternative mixers — see below.

Sizes

Presets from tiny (2M) through 6.5B, or fully custom: --size custom --dim --layers --heads.

// 04

The mixer menu

one AttnKind enum — explore it

Swap the sequence mixer without leaving the codebase. Pick one and watch how it attends — the experimental ones are labelled. Flash Attention backs the softmax path on both backends; Mixture-of-Experts routing lives in the feed-forward.

O(N²)compute vs. sequence length

Multi-Head / GQA

Grouped-Query attention via --kv-heads — the default, production path, with Flash Attention on both backends.

scores = (q @ kᵀ) * scale
scores += causal_mask
p = softmax(scores)
out = p @ v
// 05

Performance

Apple M1 Mac mini · 16 GB · measured with smedjan bench
PresetInference (fwd)Decode (1 tok)Train (fwd+bwd)
small · 7.2M22,900 tok/s173 tok/s4,400 tok/s
medium · 45M5,090 tok/s65 tok/s1,150 tok/s

Batch 4, sequence length 128 — real throughput, not theoretical peaks. The hardware simdgroup-MMA matmul (on by default) runs about 1.3–1.4× the scalar fallback; the rest of the Metal pass is batched matmul shaders, FP16 mixed precision with float accumulators, a merged forward+backward command batch, and single-instruction RoPE sincos. Reproduce with smedjan bench --size <preset>.

// 06

Quick start

a recent stable Rust toolchain is all you need
# clone and build (Metal is the default on macOS)
git clone https://github.com/smedjan/smedjan.git
cd smedjan && cargo build --release        # → ./target/release/smedjan

# or, on NVIDIA
cargo build --release --no-default-features --features cuda

# train a BPE tokenizer, prepare data, train a model
smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tok.bin
smedjan prepare   --input corpus.txt --tokenizer tok.bin --output train.bin
smedjan train     --dataset train.bin --tokenizer tok.bin --size medium --stream
// 07

Forge it with me

solo-built · looking for collaborators

The ground rules

No heavyweight dependencies. New GPU kernels ship with gradient checks. Everything is verified against real hardware — the test suite runs on the GPU, not a CPU stand-in. Keep the stack ownable.

Where it's headed

Faithful HF inference parity (config.json → model, bf16 loading), CUDA backward parity for the remaining specialized kernels, chunked O(N) RWKV forward, and longer-context evaluation. Open lanes, all of them.

This is one engineer's engine. If you want to understand and own ML infrastructure down to the kernel, there's room to build here.