[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Pre-Training & Tokenization

🏗️ Training & Alignment12 min150 BASE XP

Training an LLM From Scratch

Pre-training requires three components: a tokenizer, a dataset, and massive compute.

Tokenizer Selection

AlgorithmLibraryUsed By
BPE (Byte Pair Encoding)HF tokenizers, TiktokenGPT-4, Llama 3/4, most models
SentencePiecesentencepieceMultilingual models
FlashTokenizerCustom C++/GPUEmerging high-speed option

Pre-Training Datasets (2026)

DatasetSizeKey Feature
Common Corpus~2T tokensLargest truly open, copyright-compliant
RefinedWeb~5T tokensAggressive dedup & filtering
The Pile825GB22 diverse sources (books, code, papers)
RedPajama v230T tokensMassive Common Crawl aggregation
⚠️ Reality Check: Pre-training from scratch requires thousands of GPU-hours and millions in compute. For most use cases, continue pre-training or fine-tune an existing base model instead.
KNOWLEDGE CHECK
QUERY 1 // 2
What is the industry-standard tokenization algorithm in 2026?
Word-level
Character-level
BPE (Byte Pair Encoding)
TF-IDF
Watch: 139x Rust Speedup
Pre-Training & Tokenization | Training & Alignment — Open Source AI Academy