Getting started
Smedjan trains, fine-tunes, aligns, quantizes, and serves decoder-only transformer language models from scratch — on your own hardware, in one Rust binary. No Python, no PyTorch, no cloud. This page takes you from a text file to a model that generates.
Install
git clone https://github.com/smedjan/smedjan.git cd smedjan cargo build --release # Metal (macOS / Apple Silicon) — the default # NVIDIA: cargo build --release --no-default-features --features cuda
Get Rust from rustup.rs. On macOS the Metal backend is the default and needs no extra setup. For NVIDIA you need the CUDA toolkit (12.x). The release build lands at ./target/release/smedjan — put it on your PATH or call it directly.
The whole pipeline, four commands
One binary owns every stage. Point it at a plain-text corpus and walk it from tokenizer to trained checkpoint to sampled text:
# 1 — train a byte-pair tokenizer on your text smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tokenizer.bin # 2 — pack the text into the binary training format smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output train.bin # 3 — train a model (writes checkpoints/ as it goes; resume any time) smedjan train \ --dataset train.bin --tokenizer tokenizer.bin \ --size small --batch-size 8 --seq-len 256 \ --steps 20000 --lr 3e-4 --checkpoint-dir checkpoints/ # 4 — generate from the trained checkpoint smedjan generate \ --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \ --prompt "Once upon a time" --stream
Start small. A --size small model on a few megabytes of text trains in minutes on a laptop and is enough to see the pipeline work end to end. Scale up the size, data, and steps once the loop is proven.
What just happened
Four artefacts, one format. The tokenizer learns a vocabulary and converts text to integer tokens and back. prepare streams those tokens into a memory-mappable binary file. train fits a transformer to predict the next token, writing checkpoints/ (weights + optimizer state + step) so you can resume exactly. generate loads a checkpoint and samples new tokens.
Every later stage — fine-tuning, alignment, distillation, quantization, export — reads and writes the same checkpoint format, so you chain them without ever leaving Smedjan or touching another framework.