// guide
Inference & evaluation
Sample text from a trained checkpoint, control the decoding, speed it up with speculative decoding, and measure quality with perplexity and the built-in evals.
// 01
Generate
smedjan generate \ --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \ --prompt "The smithy" --max-tokens 256 \ --temperature 0.8 --top-p 0.95 --top-k 50 \ --repetition-penalty 1.2 --no-repeat-ngram-size 3 \ --stream
The sampling parameters stack — temperature first, then the truncation filters:
| Flag | Default | What it does |
|---|---|---|
--temperature | 0.8 | Higher = more random, lower = more deterministic. |
--top-p | 0.95 | Nucleus: keep the smallest set of tokens summing to p. |
--top-k | 50 | Keep only the k most likely tokens. |
--min-p | 0.0 | Keep tokens with probability ≥ min-p × max probability. |
--typical-p | 1.0 | Locally-typical sampling mass (1.0 = off). |
--repetition-penalty | 1.2 | Penalize already-seen tokens (1.0 = off). |
--no-repeat-ngram-size | 0 | Hard-ban repeating any n-gram (3 is a good default for assistants). |
--max-tokens | 256 | How many tokens to generate. |
--stream | off | Print tokens as they are produced. |
// 02
Go faster
Speculative decoding pairs a small fast draft model with your full model: the draft proposes several tokens, the full model verifies them in one pass. Batch mode decodes many equal-length prompts together through a single KV cache.
# speculative decoding: a small draft model proposes, the big one verifies smedjan generate --checkpoint big.bin --tokenizer tokenizer.bin \ --speculative --draft-checkpoint small.bin --draft-tokens 8 \ --prompt "Once" --stream # batch mode: decode many equal-length prompts through one KV cache smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin \ --batch-file prompts.txt
| Flag | Default | What it does |
|---|---|---|
--speculative | off | Enable draft-and-verify decoding. |
--draft-checkpoint | — | The small draft model (required with --speculative). |
--draft-tokens | 8 | Tokens the draft proposes per verification step. |
--batch-file | — | File of prompts (one per line, equal token length). |
// 03
Measure it
Perplexity is the quick scalar — exp of the mean negative log-likelihood over text; lower means the model predicts it better. The eval suite runs built-in benchmarks, with an optional synthetic long-context retrieval mode.
# score how surprised the model is by held-out text (lower = better) smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt # built-in benchmark suite smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin # synthetic long-context retrieval (needle-in-a-haystack / RULER-style) smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin \ --longctx --longctx-lengths 256,512,1024 --longctx-depths 0.0,0.5,1.0 # inspect a checkpoint's architecture and parameter count smedjan info --checkpoint m.bin