// documentation

Getting started

Smedjan trains, fine-tunes, aligns, quantizes, and serves decoder-only transformer language models from scratch — on your own hardware, in one Rust binary. No Python, no PyTorch, no cloud. This page takes you from a text file to a model that generates.

// 01

Install

needs a recent stable Rust toolchain

git clone https://github.com/smedjan/smedjan.git
cd smedjan
cargo build --release        # Metal (macOS / Apple Silicon) — the default
# NVIDIA:
cargo build --release --no-default-features --features cuda

Get Rust from rustup.rs. On macOS the Metal backend is the default and needs no extra setup. For NVIDIA you need the CUDA toolkit (12.x). The release build lands at ./target/release/smedjan — put it on your PATH or call it directly.

// 02

The whole pipeline, four commands

corpus.txt → generated text

One binary owns every stage. Point it at a plain-text corpus and walk it from tokenizer to trained checkpoint to sampled text:

# 1 — train a byte-pair tokenizer on your text
smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tokenizer.bin

# 2 — pack the text into the binary training format
smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output train.bin

# 3 — train a model (writes checkpoints/ as it goes; resume any time)
smedjan train \
  --dataset train.bin --tokenizer tokenizer.bin \
  --size small --batch-size 8 --seq-len 256 \
  --steps 20000 --lr 3e-4 --checkpoint-dir checkpoints/

# 4 — generate from the trained checkpoint
smedjan generate \
  --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \
  --prompt "Once upon a time" --stream

Start small. A --size small model on a few megabytes of text trains in minutes on a laptop and is enough to see the pipeline work end to end. Scale up the size, data, and steps once the loop is proven.

// 03

What just happened

the mental model

Four artefacts, one format. The tokenizer learns a vocabulary and converts text to integer tokens and back. prepare streams those tokens into a memory-mappable binary file. train fits a transformer to predict the next token, writing checkpoints/ (weights + optimizer state + step) so you can resume exactly. generate loads a checkpoint and samples new tokens.

Every later stage — fine-tuning, alignment, distillation, quantization, export — reads and writes the same checkpoint format, so you chain them without ever leaving Smedjan or touching another framework.

// 04

Where to next

pick a guide

Core concepts →Tokenizer, model architecture, checkpoints, the two backends.Data & tokenizer →Build, clean, dedup, and mix your training corpus.Training →Sizes, the knobs that matter, optimizers, and memory.Fine-tuning & alignment →Turn a base model into an instruction follower with SFT and DPO.Inference & evaluation →Sampling, streaming, speculative decoding, perplexity, eval.CLI reference →Every subcommand and flag, straight from the binary.