// guide

Inference & evaluation

Sample text from a trained checkpoint, control the decoding, speed it up with speculative decoding, and measure quality with perplexity and the built-in evals.

// 01

Generate

sampling you control
smedjan generate \
  --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \
  --prompt "The smithy" --max-tokens 256 \
  --temperature 0.8 --top-p 0.95 --top-k 50 \
  --repetition-penalty 1.2 --no-repeat-ngram-size 3 \
  --stream

The sampling parameters stack — temperature first, then the truncation filters:

FlagDefaultWhat it does
--temperature0.8Higher = more random, lower = more deterministic.
--top-p0.95Nucleus: keep the smallest set of tokens summing to p.
--top-k50Keep only the k most likely tokens.
--min-p0.0Keep tokens with probability ≥ min-p × max probability.
--typical-p1.0Locally-typical sampling mass (1.0 = off).
--repetition-penalty1.2Penalize already-seen tokens (1.0 = off).
--no-repeat-ngram-size0Hard-ban repeating any n-gram (3 is a good default for assistants).
--max-tokens256How many tokens to generate.
--streamoffPrint tokens as they are produced.
// 02

Go faster

speculative & batched decoding

Speculative decoding pairs a small fast draft model with your full model: the draft proposes several tokens, the full model verifies them in one pass. Batch mode decodes many equal-length prompts together through a single KV cache.

# speculative decoding: a small draft model proposes, the big one verifies
smedjan generate --checkpoint big.bin --tokenizer tokenizer.bin \
  --speculative --draft-checkpoint small.bin --draft-tokens 8 \
  --prompt "Once" --stream

# batch mode: decode many equal-length prompts through one KV cache
smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin \
  --batch-file prompts.txt
FlagDefaultWhat it does
--speculativeoffEnable draft-and-verify decoding.
--draft-checkpointThe small draft model (required with --speculative).
--draft-tokens8Tokens the draft proposes per verification step.
--batch-fileFile of prompts (one per line, equal token length).
// 03

Measure it

perplexity, eval, info

Perplexity is the quick scalar — exp of the mean negative log-likelihood over text; lower means the model predicts it better. The eval suite runs built-in benchmarks, with an optional synthetic long-context retrieval mode.

# score how surprised the model is by held-out text (lower = better)
smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt

# built-in benchmark suite
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin

# synthetic long-context retrieval (needle-in-a-haystack / RULER-style)
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin \
  --longctx --longctx-lengths 256,512,1024 --longctx-depths 0.0,0.5,1.0

# inspect a checkpoint's architecture and parameter count
smedjan info --checkpoint m.bin