// guide

Inference & evaluation

Sample text from a trained checkpoint, control the decoding, speed it up with speculative decoding, and measure quality with perplexity and the built-in evals.

// 01

Generate

sampling you control

smedjan generate \
  --checkpoint checkpoints/final.bin --tokenizer tokenizer.bin \
  --prompt "The smithy" --max-tokens 256 \
  --temperature 0.8 --top-p 0.95 --top-k 50 \
  --repetition-penalty 1.2 --no-repeat-ngram-size 3 \
  --stream

The sampling parameters stack — temperature first, then the truncation filters:

Flag	Default	What it does
`--temperature`	0.8	Higher = more random, lower = more deterministic.
`--top-p`	0.95	Nucleus: keep the smallest set of tokens summing to p.
`--top-k`	50	Keep only the k most likely tokens.
`--min-p`	0.0	Keep tokens with probability ≥ min-p × max probability.
`--typical-p`	1.0	Locally-typical sampling mass (1.0 = off).
`--repetition-penalty`	1.2	Penalize already-seen tokens (1.0 = off).
`--no-repeat-ngram-size`	0	Hard-ban repeating any n-gram (3 is a good default for assistants).
`--max-tokens`	256	How many tokens to generate.
`--stream`	off	Print tokens as they are produced.

// 02

Go faster

speculative & batched decoding

Speculative decoding pairs a small fast draft model with your full model: the draft proposes several tokens, the full model verifies them in one pass. Batch mode decodes many equal-length prompts together through a single KV cache.

# speculative decoding: a small draft model proposes, the big one verifies
smedjan generate --checkpoint big.bin --tokenizer tokenizer.bin \
  --speculative --draft-checkpoint small.bin --draft-tokens 8 \
  --prompt "Once" --stream

# batch mode: decode many equal-length prompts through one KV cache
smedjan generate --checkpoint m.bin --tokenizer tokenizer.bin \
  --batch-file prompts.txt

Flag	Default	What it does
`--speculative`	off	Enable draft-and-verify decoding.
`--draft-checkpoint`	—	The small draft model (required with --speculative).
`--draft-tokens`	8	Tokens the draft proposes per verification step.
`--batch-file`	—	File of prompts (one per line, equal token length).

// 03

Measure it

perplexity, eval, info

Perplexity is the quick scalar — exp of the mean negative log-likelihood over text; lower means the model predicts it better. The eval suite runs built-in benchmarks, with an optional synthetic long-context retrieval mode.

# score how surprised the model is by held-out text (lower = better)
smedjan perplexity --checkpoint m.bin --tokenizer tokenizer.bin --file heldout.txt

# built-in benchmark suite
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin

# synthetic long-context retrieval (needle-in-a-haystack / RULER-style)
smedjan eval --checkpoint m.bin --tokenizer tokenizer.bin \
  --longctx --longctx-lengths 256,512,1024 --longctx-depths 0.0,0.5,1.0

# inspect a checkpoint's architecture and parameter count
smedjan info --checkpoint m.bin

Quantize & export →Make the model smaller and portable.Performance & tuning →How fast it goes, and how to make it faster.