// guide

Data & tokenizer

Everything upstream of training: learn a vocabulary, turn raw text into the binary token format, and clean, dedup, and mix your corpus. All of it is the same binary.

// 01

Train or import a tokenizer

byte-pair encoding

Train a BPE tokenizer on your corpus, or import an existing GPT-2 / HuggingFace merges file. Pick the vocabulary size up front — larger vocab means shorter sequences but a bigger embedding table.

# train a byte-pair tokenizer (vocab defaults to 32000)
smedjan tokenizer --input corpus.txt --vocab-size 16000 --output tokenizer.bin

# or import a GPT-2 / HuggingFace merges.txt as a byte-level BPE
smedjan import-bpe --merges merges.txt --output tokenizer.bin

Flag	Default	What it does
`--input`	—	Text corpus to learn the vocabulary from (tokenizer).
`--vocab-size`	32000	Target vocabulary size.
`--merges`	—	GPT-2/HF merges.txt to import (import-bpe).
`--output`	tokenizer.bin	Where to write the tokenizer.

// 02

Prepare training data

text → binary

Tokenize raw text into the binary stream the trainer memory-maps. Run it once per corpus; the output is what you pass to train --dataset.

# tokenize raw text → memory-mappable binary token stream
smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output train.bin

// 03

Clean with provenance

the process pipeline

For real corpora, run text through the cleaning pipeline first. It splits documents on a separator and can record provenance — source name, URL, and license — to a log, so you keep an audit trail of what went into the model.

# clean text and record where it came from, splitting documents on blank lines
smedjan process \
  --input raw.txt --tokenizer tokenizer.bin --output clean.bin \
  --separator "\n\n" \
  --provenance-log prov.log --source-name wikipedia --source-url https://… \
  --license CC-BY-SA

Flag	Default	What it does
`--separator`	"\n\n"	Document boundary. Empty = treat the file as one document.
`--provenance-log`	—	Append a provenance record to this file.
`--source-name`	unknown	Source label for provenance.
`--source-url`	""	Source URL for provenance.
`--license`	unknown	License string for provenance.

// 04

Deduplicate & filter

MinHash + quality

Near-duplicate documents waste training and hurt generalization. dedup removes them with MinHash similarity and drops low-quality text below a score threshold. Input is one document per line.

# MinHash near-duplicate removal + quality filtering (one document per line)
smedjan dedup --input docs.txt --output filtered.txt \
  --threshold 0.8 --min-quality 0.3

Flag	Default	What it does
`--threshold`	0.8	MinHash similarity (0–1) above which documents are considered duplicates.
`--min-quality`	0.3	Minimum quality score (0–1) to keep a document.

// 05

Mix datasets

weighted blends

Combine tokenized shards in fixed proportions — useful for balancing domains (e.g. 70% books, 30% web). Weights are relative.

# blend tokenized shards with weights (path:weight, comma-separated)
smedjan mix --shards books.bin:0.7,web.bin:0.3 --output train.bin

// 06

Reproducibility

hash your inputs

smedjan hash --file train.bin prints a SHA-256 so you can pin exactly which data produced a checkpoint.

Training →Feed your dataset to the trainer.Fine-tuning & alignment →Prepare SFT/DPO data from instructions and preferences.