Pure-Rust LLM engine · Metal + CUDA

OWN THE STACK.ZERO PYTHON.

The smithy for language models. Pure Rust, your hardware, your weights.

Every headline above is written, right now, by a 295K-parameter Smedjan model compiled to WebAssembly and running on your machine — no server, no API.

Read the docs →View on GitHub

$cargo install smedjan

// 01

Own the stack

the manifesto

Every other path to training your own model drags in PyTorch, a CUDA-only stack, a Python runtime, and cloud lock-in. Smedjan is the opposite: you own the entire stack, end to end, in one language.

Own the stack

Every line, every GPU kernel, every byte of the model lives in one repo. No framework you don't control sits between you and the metal.

Zero Python, zero PyTorch

The entire dependency tree is four small crates — clap, rand, memmap2, byteorder — plus the GPU FFI bindings. Nothing else to install, audit, or trust.

Two backends, one codebase

Metal on Apple Silicon, CUDA on NVIDIA, selected at compile time. Checkpoints are portable across both — train on a Mac, resume on an H100.

From scratch, but complete

Not a toy. Tokenizer → pretrain → distill → SFT → DPO → quantize → export → serve. The whole pipeline, in one language.

// 02

From raw text to a served checkpoint

every stage is a real subcommand

tokenizer

train a BPE tokenizer

prepare

text → binary format

process

filter + dedup

mix

blend datasets

train

pretrain from scratch

distill

KL + CE from a teacher

sft

supervised fine-tune

dpo

preference alignment

quantize

Q4 / Q8 post-training

export-gguf

llama.cpp-compatible

generate

sample · stream · spec-decode

One binary, one tokenizer, one checkpoint format. No glue scripts, no handoff between frameworks, no step that secretly needs Python.

// 03

Architecture

decoder-only transformer

src/

model.rstransformer — presets + custom sizes

attention.rsGQA + Linear · SSM · RWKV · MLA · block-sparse

tensor.rsGPU tensor operations

autograd.rstape-based reverse-mode autodiff

train.rsloop · grad-accum · validation · resume

generate.rssampling · speculative decoding

optim.rsAdamW · Muon / NorMuon · schedulers

quantize.rsQ4/Q8 · GGUF export

safetensors.rssafetensors + HF-Llama interop

backend-agnostic above this line

metal/Apple Silicon · MSL kernels

cuda/NVIDIA · cudarc

The model

Decoder-only transformer, pre-norm (RMSNorm), Rotary Position Embeddings with NTK-aware and YaRN scaling, SwiGLU feed-forward, and an lm_head weight-tied to the embedding.

Attention

Multi-Head or Grouped-Query (GQA) via --kv-heads, with Flash Attention on both backends. Plus a research menu of alternative mixers — see below.

Sizes

Presets from tiny (2M) through 6.5B, or fully custom: --size custom --dim --layers --heads.

// 04

The mixer menu

one AttnKind enum — explore it

Swap the sequence mixer without leaving the codebase. Pick one and watch how it attends — the experimental ones are labelled. Flash Attention backs the softmax path on both backends; Mixture-of-Experts routing lives in the feed-forward.

O(N²)compute vs. sequence length

Multi-Head / GQA

Grouped-Query attention via --kv-heads — the default, production path, with Flash Attention on both backends.

scores = (q @ kᵀ) * scale
scores += causal_mask
p = softmax(scores)
out = p @ v

// 05

Performance

Apple M1 Mac mini · 16 GB · measured with smedjan bench

Preset	Inference (fwd)	Decode (1 tok)	Train (fwd+bwd)
small · 7.2M	22,900 tok/s	173 tok/s	4,400 tok/s
medium · 45M	5,090 tok/s	65 tok/s	1,150 tok/s

Batch 4, sequence length 128 — real throughput, not theoretical peaks. The hardware simdgroup-MMA matmul (on by default) runs about 1.3–1.4× the scalar fallback; the rest of the Metal pass is batched matmul shaders, FP16 mixed precision with float accumulators, a merged forward+backward command batch, and single-instruction RoPE sincos. Reproduce with smedjan bench --size <preset>.

// 06

Quick start

a recent stable Rust toolchain is all you need

# clone and build (Metal is the default on macOS)
git clone https://github.com/smedjan/smedjan.git
cd smedjan && cargo build --release        # → ./target/release/smedjan

# or, on NVIDIA
cargo build --release --no-default-features --features cuda

# train a BPE tokenizer, prepare data, train a model
smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tok.bin
smedjan prepare   --input corpus.txt --tokenizer tok.bin --output train.bin
smedjan train     --dataset train.bin --tokenizer tok.bin --size medium --stream

// 07

Forge it with me

solo-built · looking for collaborators

The ground rules

No heavyweight dependencies. New GPU kernels ship with gradient checks. Everything is verified against real hardware — the test suite runs on the GPU, not a CPU stand-in. Keep the stack ownable.

Where it's headed

Faithful HF inference parity (config.json → model, bf16 loading), CUDA backward parity for the remaining specialized kernels, chunked O(N) RWKV forward, and longer-context evaluation. Open lanes, all of them.

This is one engineer's engine. If you want to understand and own ML infrastructure down to the kernel, there's room to build here.

OWN THE STACK.ZERO PYTHON.

Own the stack

Own the stack

Zero Python, zero PyTorch

Two backends, one codebase

From scratch, but complete

From raw text to a served checkpoint

Architecture

The model

Attention

Sizes

The mixer menu

Multi-Head / GQA

Performance

Quick start

Forge it with me

The ground rules

Where it's headed

Back the forge

Sponsor