// reference

CLI reference

Every subcommand and its main flags, taken straight from the binary. For the exhaustive, always-current list run `smedjan <command> --help`.

// 01

Data

build the corpus

tokenizerTrain a BPE tokenizer from a text corpus.

smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tokenizer.bin

`--input`	Text corpus (required).
`--vocab-size`	Default 32000.
`--output`	Default tokenizer.bin.

import-bpeImport a GPT-2 / HuggingFace merges.txt as a byte-level BPE.

smedjan import-bpe --merges merges.txt --output tokenizer.bin

`--merges`	Path to merges.txt (required).
`--output`	Tokenizer output (required).

prepareTokenize raw text into the binary training format.

smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output dataset.bin

`--input`	Raw text (required).
`--tokenizer`	Tokenizer (required).
`--output`	Default dataset.bin.

processClean a text file through the data pipeline, with provenance.

smedjan process --input raw.txt --tokenizer tokenizer.bin --output clean.bin

`--separator`	Document boundary; default "\n\n".
`--provenance-log`	Provenance log file.
`--source-name / --source-url / --license`	Provenance fields.

dedupDeduplicate (MinHash) and quality-filter documents.

smedjan dedup --input docs.txt --output filtered.txt

`--threshold`	Similarity 0–1; default 0.8.
`--min-quality`	Quality 0–1; default 0.3.

mixMix tokenized shards into one dataset with weights.

smedjan mix --shards a.bin:0.7,b.bin:0.3 --output train.bin

`--shards`	path:weight,path:weight,… (required).
`--output`	Output dataset (required).

hashSHA-256 of a file (pin your inputs).

smedjan hash --file train.bin

--file File to hash (required).

// 02

Train & adapt

fit and refine

trainPre-train a model from scratch (and distil from a teacher).

smedjan train --dataset train.bin --tokenizer tokenizer.bin --size medium --steps 50000

`--size`	tiny · small · medium · large · max · custom (default small).
`--batch-size / --seq-len / --steps / --lr / --warmup`	Core schedule.
`--grad-accum / --gradient-checkpointing / --fused-ce`	Memory.
`--optimizer`	adamw · muon · hybrid · sophia · adamw-8bit.
`--resume / --val-dataset`	Resume + validation.
`--teacher-checkpoint`	KL+CE distillation.
`(many more)`	See `smedjan train --help`.

sftSupervised fine-tuning on instruction/response JSONL.

smedjan sft --checkpoint final.bin --tokenizer tokenizer.bin --data instr.jsonl

`--data`	JSONL {"prompt","response"} (required).
`--lr`	Default 2e-5.
`--steps`	Default 1000.
`--batch-size`	Default 8.
`--output-dir`	Default sft_checkpoints.

sft-prepareConvert paired text (e.g. NL2Bash) into SFT JSONL.

smedjan sft-prepare --input pairs.tsv --output instr.jsonl

`--input`	Tab-separated or alternating-line pairs.
`--output`	JSONL output.

dpo-prepareConvert preference pairs JSONL to the binary DPO format.

smedjan dpo-prepare --input prefs.jsonl --output prefs.bin --tokenizer tokenizer.bin

`--input`	JSONL {"prompt","chosen","rejected"}.
`--output`	Binary dataset.

dpoDirect Preference Optimization alignment.

smedjan dpo --checkpoint sft.bin --ref-checkpoint sft.bin --tokenizer tokenizer.bin --dataset prefs.bin

`--checkpoint`	Policy (updated).
`--ref-checkpoint`	Frozen reference.
`--beta`	Default 0.1.
`--lr`	Default 1e-6.

distillGenerate instruction data from an external model (Ollama/Claude/OpenAI).

smedjan distill --model qwen2.5:7b --n-samples 1000 --output synth.jsonl

`--api-url`	Default Ollama localhost.
`--api-key`	For hosted APIs.
`--model / --n-samples / --max-tokens`	Generation controls.

growGrow a small trained model into a larger architecture.

smedjan grow --checkpoint small.bin --output big.bin --dim 1024 --layers 24 --heads 16

--dim / --layers / --heads Target architecture (required).

mergeAverage several checkpoints (weight-space merging).

smedjan merge --checkpoints a.bin b.bin --output merged.bin

`--checkpoints`	Two or more checkpoints.
`--output`	Default merged.bin.

// 03

Inference & evaluation

run and measure

generateGenerate text from a checkpoint.

smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin --prompt "Hi" --stream

`--temperature / --top-p / --top-k / --min-p / --typical-p`	Sampling.
`--repetition-penalty / --no-repeat-ngram-size`	Anti-repetition.
`--speculative / --draft-checkpoint / --draft-tokens`	Speculative decoding.
`--batch-file`	Batched prompts.
`--stream`	Stream tokens.

perplexityPerplexity (exp mean NLL) over text or a file.

smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt

`--text`	Inline text to score.
`--file`	Score a file instead.

evalEvaluate against built-in benchmarks.

smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin

`--longctx`	Synthetic long-context (NIAH/RULER) suite.
`--longctx-lengths`	Default 256,512,1024.
`--longctx-depths`	Default 0.0,0.5,1.0.

infoShow a checkpoint's architecture and parameter count.

smedjan info --checkpoint m.bin

--checkpoint Checkpoint to inspect (required).

sizesPrint preset model sizes and parameter counts.

smedjan sizes --vocab-size 8192

--vocab-size Vocab to compute counts for; default 8192.

benchBenchmark inference and training throughput.

smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20

`--size`	tiny · small · medium · large.
`--iters / --warmup`	Timed + warmup iterations.
`--simdgroup-matmul`	Measure the hardware MMA fast path.

// 04

Export & deploy

ship it

quantizeQuantize a checkpoint (Q8 ≈ 4×, Q4 ≈ 8× smaller).

smedjan quantize --checkpoint final.bin --output model.qbin --bits 4

`--bits`	4 or 8 (default 4).
`--output`	Default model.qbin.

export-ggufExport to GGUF for llama.cpp.

smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q8_0

`--quant`	f32 or q8_0 (default f32).
`--output`	Default model.gguf.

export-safetensorsExport to safetensors (HuggingFace).

smedjan export-safetensors --checkpoint final.bin --output model.safetensors

--output Default model.safetensors.