// reference

CLI reference

Every subcommand and its main flags, taken straight from the binary. For the exhaustive, always-current list run `smedjan <command> --help`.

// 01

Data

build the corpus
tokenizerTrain a BPE tokenizer from a text corpus.
smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tokenizer.bin
--inputText corpus (required).
--vocab-sizeDefault 32000.
--outputDefault tokenizer.bin.
import-bpeImport a GPT-2 / HuggingFace merges.txt as a byte-level BPE.
smedjan import-bpe --merges merges.txt --output tokenizer.bin
--mergesPath to merges.txt (required).
--outputTokenizer output (required).
prepareTokenize raw text into the binary training format.
smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output dataset.bin
--inputRaw text (required).
--tokenizerTokenizer (required).
--outputDefault dataset.bin.
processClean a text file through the data pipeline, with provenance.
smedjan process --input raw.txt --tokenizer tokenizer.bin --output clean.bin
--separatorDocument boundary; default "\n\n".
--provenance-logProvenance log file.
--source-name / --source-url / --licenseProvenance fields.
dedupDeduplicate (MinHash) and quality-filter documents.
smedjan dedup --input docs.txt --output filtered.txt
--thresholdSimilarity 0–1; default 0.8.
--min-qualityQuality 0–1; default 0.3.
mixMix tokenized shards into one dataset with weights.
smedjan mix --shards a.bin:0.7,b.bin:0.3 --output train.bin
--shardspath:weight,path:weight,… (required).
--outputOutput dataset (required).
hashSHA-256 of a file (pin your inputs).
smedjan hash --file train.bin
--fileFile to hash (required).
// 02

Train & adapt

fit and refine
trainPre-train a model from scratch (and distil from a teacher).
smedjan train --dataset train.bin --tokenizer tokenizer.bin --size medium --steps 50000
--sizetiny · small · medium · large · max · custom (default small).
--batch-size / --seq-len / --steps / --lr / --warmupCore schedule.
--grad-accum / --gradient-checkpointing / --fused-ceMemory.
--optimizeradamw · muon · hybrid · sophia · adamw-8bit.
--resume / --val-datasetResume + validation.
--teacher-checkpointKL+CE distillation.
(many more)See `smedjan train --help`.
sftSupervised fine-tuning on instruction/response JSONL.
smedjan sft --checkpoint final.bin --tokenizer tokenizer.bin --data instr.jsonl
--dataJSONL {"prompt","response"} (required).
--lrDefault 2e-5.
--stepsDefault 1000.
--batch-sizeDefault 8.
--output-dirDefault sft_checkpoints.
sft-prepareConvert paired text (e.g. NL2Bash) into SFT JSONL.
smedjan sft-prepare --input pairs.tsv --output instr.jsonl
--inputTab-separated or alternating-line pairs.
--outputJSONL output.
dpo-prepareConvert preference pairs JSONL to the binary DPO format.
smedjan dpo-prepare --input prefs.jsonl --output prefs.bin --tokenizer tokenizer.bin
--inputJSONL {"prompt","chosen","rejected"}.
--outputBinary dataset.
dpoDirect Preference Optimization alignment.
smedjan dpo --checkpoint sft.bin --ref-checkpoint sft.bin --tokenizer tokenizer.bin --dataset prefs.bin
--checkpointPolicy (updated).
--ref-checkpointFrozen reference.
--betaDefault 0.1.
--lrDefault 1e-6.
distillGenerate instruction data from an external model (Ollama/Claude/OpenAI).
smedjan distill --model qwen2.5:7b --n-samples 1000 --output synth.jsonl
--api-urlDefault Ollama localhost.
--api-keyFor hosted APIs.
--model / --n-samples / --max-tokensGeneration controls.
growGrow a small trained model into a larger architecture.
smedjan grow --checkpoint small.bin --output big.bin --dim 1024 --layers 24 --heads 16
--dim / --layers / --headsTarget architecture (required).
mergeAverage several checkpoints (weight-space merging).
smedjan merge --checkpoints a.bin b.bin --output merged.bin
--checkpointsTwo or more checkpoints.
--outputDefault merged.bin.
// 03

Inference & evaluation

run and measure
generateGenerate text from a checkpoint.
smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin --prompt "Hi" --stream
--temperature / --top-p / --top-k / --min-p / --typical-pSampling.
--repetition-penalty / --no-repeat-ngram-sizeAnti-repetition.
--speculative / --draft-checkpoint / --draft-tokensSpeculative decoding.
--batch-fileBatched prompts.
--streamStream tokens.
perplexityPerplexity (exp mean NLL) over text or a file.
smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt
--textInline text to score.
--fileScore a file instead.
evalEvaluate against built-in benchmarks.
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin
--longctxSynthetic long-context (NIAH/RULER) suite.
--longctx-lengthsDefault 256,512,1024.
--longctx-depthsDefault 0.0,0.5,1.0.
infoShow a checkpoint's architecture and parameter count.
smedjan info --checkpoint m.bin
--checkpointCheckpoint to inspect (required).
sizesPrint preset model sizes and parameter counts.
smedjan sizes --vocab-size 8192
--vocab-sizeVocab to compute counts for; default 8192.
benchBenchmark inference and training throughput.
smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20
--sizetiny · small · medium · large.
--iters / --warmupTimed + warmup iterations.
--simdgroup-matmulMeasure the hardware MMA fast path.
// 04

Export & deploy

ship it
quantizeQuantize a checkpoint (Q8 ≈ 4×, Q4 ≈ 8× smaller).
smedjan quantize --checkpoint final.bin --output model.qbin --bits 4
--bits4 or 8 (default 4).
--outputDefault model.qbin.
export-ggufExport to GGUF for llama.cpp.
smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q8_0
--quantf32 or q8_0 (default f32).
--outputDefault model.gguf.
export-safetensorsExport to safetensors (HuggingFace).
smedjan export-safetensors --checkpoint final.bin --output model.safetensors
--outputDefault model.safetensors.