// reference
CLI reference
Every subcommand and its main flags, taken straight from the binary. For the exhaustive, always-current list run `smedjan <command> --help`.
// 01
Data
build the corpustokenizerTrain a BPE tokenizer from a text corpus.
smedjan tokenizer --input corpus.txt --vocab-size 8192 --output tokenizer.bin
--input | Text corpus (required). |
--vocab-size | Default 32000. |
--output | Default tokenizer.bin. |
import-bpeImport a GPT-2 / HuggingFace merges.txt as a byte-level BPE.
smedjan import-bpe --merges merges.txt --output tokenizer.bin
--merges | Path to merges.txt (required). |
--output | Tokenizer output (required). |
prepareTokenize raw text into the binary training format.
smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output dataset.bin
--input | Raw text (required). |
--tokenizer | Tokenizer (required). |
--output | Default dataset.bin. |
processClean a text file through the data pipeline, with provenance.
smedjan process --input raw.txt --tokenizer tokenizer.bin --output clean.bin
--separator | Document boundary; default "\n\n". |
--provenance-log | Provenance log file. |
--source-name / --source-url / --license | Provenance fields. |
dedupDeduplicate (MinHash) and quality-filter documents.
smedjan dedup --input docs.txt --output filtered.txt
--threshold | Similarity 0–1; default 0.8. |
--min-quality | Quality 0–1; default 0.3. |
mixMix tokenized shards into one dataset with weights.
smedjan mix --shards a.bin:0.7,b.bin:0.3 --output train.bin
--shards | path:weight,path:weight,… (required). |
--output | Output dataset (required). |
hashSHA-256 of a file (pin your inputs).
smedjan hash --file train.bin
--file | File to hash (required). |
// 02
Train & adapt
fit and refinetrainPre-train a model from scratch (and distil from a teacher).
smedjan train --dataset train.bin --tokenizer tokenizer.bin --size medium --steps 50000
--size | tiny · small · medium · large · max · custom (default small). |
--batch-size / --seq-len / --steps / --lr / --warmup | Core schedule. |
--grad-accum / --gradient-checkpointing / --fused-ce | Memory. |
--optimizer | adamw · muon · hybrid · sophia · adamw-8bit. |
--resume / --val-dataset | Resume + validation. |
--teacher-checkpoint | KL+CE distillation. |
(many more) | See `smedjan train --help`. |
sftSupervised fine-tuning on instruction/response JSONL.
smedjan sft --checkpoint final.bin --tokenizer tokenizer.bin --data instr.jsonl
--data | JSONL {"prompt","response"} (required). |
--lr | Default 2e-5. |
--steps | Default 1000. |
--batch-size | Default 8. |
--output-dir | Default sft_checkpoints. |
sft-prepareConvert paired text (e.g. NL2Bash) into SFT JSONL.
smedjan sft-prepare --input pairs.tsv --output instr.jsonl
--input | Tab-separated or alternating-line pairs. |
--output | JSONL output. |
dpo-prepareConvert preference pairs JSONL to the binary DPO format.
smedjan dpo-prepare --input prefs.jsonl --output prefs.bin --tokenizer tokenizer.bin
--input | JSONL {"prompt","chosen","rejected"}. |
--output | Binary dataset. |
dpoDirect Preference Optimization alignment.
smedjan dpo --checkpoint sft.bin --ref-checkpoint sft.bin --tokenizer tokenizer.bin --dataset prefs.bin
--checkpoint | Policy (updated). |
--ref-checkpoint | Frozen reference. |
--beta | Default 0.1. |
--lr | Default 1e-6. |
distillGenerate instruction data from an external model (Ollama/Claude/OpenAI).
smedjan distill --model qwen2.5:7b --n-samples 1000 --output synth.jsonl
--api-url | Default Ollama localhost. |
--api-key | For hosted APIs. |
--model / --n-samples / --max-tokens | Generation controls. |
growGrow a small trained model into a larger architecture.
smedjan grow --checkpoint small.bin --output big.bin --dim 1024 --layers 24 --heads 16
--dim / --layers / --heads | Target architecture (required). |
mergeAverage several checkpoints (weight-space merging).
smedjan merge --checkpoints a.bin b.bin --output merged.bin
--checkpoints | Two or more checkpoints. |
--output | Default merged.bin. |
// 03
Inference & evaluation
run and measuregenerateGenerate text from a checkpoint.
smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin --prompt "Hi" --stream
--temperature / --top-p / --top-k / --min-p / --typical-p | Sampling. |
--repetition-penalty / --no-repeat-ngram-size | Anti-repetition. |
--speculative / --draft-checkpoint / --draft-tokens | Speculative decoding. |
--batch-file | Batched prompts. |
--stream | Stream tokens. |
perplexityPerplexity (exp mean NLL) over text or a file.
smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt
--text | Inline text to score. |
--file | Score a file instead. |
evalEvaluate against built-in benchmarks.
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin
--longctx | Synthetic long-context (NIAH/RULER) suite. |
--longctx-lengths | Default 256,512,1024. |
--longctx-depths | Default 0.0,0.5,1.0. |
infoShow a checkpoint's architecture and parameter count.
smedjan info --checkpoint m.bin
--checkpoint | Checkpoint to inspect (required). |
sizesPrint preset model sizes and parameter counts.
smedjan sizes --vocab-size 8192
--vocab-size | Vocab to compute counts for; default 8192. |
benchBenchmark inference and training throughput.
smedjan bench --size small --batch-size 4 --seq-len 128 --iters 20
--size | tiny · small · medium · large. |
--iters / --warmup | Timed + warmup iterations. |
--simdgroup-matmul | Measure the hardware MMA fast path. |
// 04
Export & deploy
ship itquantizeQuantize a checkpoint (Q8 ≈ 4×, Q4 ≈ 8× smaller).
smedjan quantize --checkpoint final.bin --output model.qbin --bits 4
--bits | 4 or 8 (default 4). |
--output | Default model.qbin. |
export-ggufExport to GGUF for llama.cpp.
smedjan export-gguf --checkpoint final.bin --output model.gguf --quant q8_0
--quant | f32 or q8_0 (default f32). |
--output | Default model.gguf. |
export-safetensorsExport to safetensors (HuggingFace).
smedjan export-safetensors --checkpoint final.bin --output model.safetensors
--output | Default model.safetensors. |