// guide

Data & tokenizer

Everything upstream of training: learn a vocabulary, turn raw text into the binary token format, and clean, dedup, and mix your corpus. All of it is the same binary.

// 01

Train or import a tokenizer

byte-pair encoding

Train a BPE tokenizer on your corpus, or import an existing GPT-2 / HuggingFace merges file. Pick the vocabulary size up front — larger vocab means shorter sequences but a bigger embedding table.

# train a byte-pair tokenizer (vocab defaults to 32000)
smedjan tokenizer --input corpus.txt --vocab-size 16000 --output tokenizer.bin

# or import a GPT-2 / HuggingFace merges.txt as a byte-level BPE
smedjan import-bpe --merges merges.txt --output tokenizer.bin
FlagDefaultWhat it does
--inputText corpus to learn the vocabulary from (tokenizer).
--vocab-size32000Target vocabulary size.
--mergesGPT-2/HF merges.txt to import (import-bpe).
--outputtokenizer.binWhere to write the tokenizer.
// 02

Prepare training data

text → binary

Tokenize raw text into the binary stream the trainer memory-maps. Run it once per corpus; the output is what you pass to train --dataset.

# tokenize raw text → memory-mappable binary token stream
smedjan prepare --input corpus.txt --tokenizer tokenizer.bin --output train.bin
// 03

Clean with provenance

the process pipeline

For real corpora, run text through the cleaning pipeline first. It splits documents on a separator and can record provenance — source name, URL, and license — to a log, so you keep an audit trail of what went into the model.

# clean text and record where it came from, splitting documents on blank lines
smedjan process \
  --input raw.txt --tokenizer tokenizer.bin --output clean.bin \
  --separator "\n\n" \
  --provenance-log prov.log --source-name wikipedia --source-url https://… \
  --license CC-BY-SA
FlagDefaultWhat it does
--separator"\n\n"Document boundary. Empty = treat the file as one document.
--provenance-logAppend a provenance record to this file.
--source-nameunknownSource label for provenance.
--source-url""Source URL for provenance.
--licenseunknownLicense string for provenance.
// 04

Deduplicate & filter

MinHash + quality

Near-duplicate documents waste training and hurt generalization. dedup removes them with MinHash similarity and drops low-quality text below a score threshold. Input is one document per line.

# MinHash near-duplicate removal + quality filtering (one document per line)
smedjan dedup --input docs.txt --output filtered.txt \
  --threshold 0.8 --min-quality 0.3
FlagDefaultWhat it does
--threshold0.8MinHash similarity (0–1) above which documents are considered duplicates.
--min-quality0.3Minimum quality score (0–1) to keep a document.
// 05

Mix datasets

weighted blends

Combine tokenized shards in fixed proportions — useful for balancing domains (e.g. 70% books, 30% web). Weights are relative.

# blend tokenized shards with weights (path:weight, comma-separated)
smedjan mix --shards books.bin:0.7,web.bin:0.3 --output train.bin
// 06

Reproducibility

hash your inputs

smedjan hash --file train.bin prints a SHA-256 so you can pin exactly which data produced a checkpoint.