⚡

VOID

A from-scratch LLM training engine built entirely in Rust with CUDA GPU acceleration. No PyTorch. No Python. Pure native performance — every tensor op, every gradient, every CUDA kernel hand-written.

RustCUDARayon ParallelZero Dependencies*Self-Evolving94 Tests

*No ML framework dependency — custom tensor + autograd library

CUDA KERNELSCUSTOM TENSOR ENGINEBPE TOKENIZERGROUPED QUERY ATTENTIONRMSNORMSWIGLUROPE POSITIONSCOSINE ANNEALINGGRADIENT ACCUMULATIONSELF-EVOLUTIONVOID STUDIOcuBLAS SGEMMAUTOGRAD ENGINEKV CACHE303M PARAMS810 TOK/S62 SOURCE FILESCUDA KERNELSCUSTOM TENSOR ENGINEBPE TOKENIZERGROUPED QUERY ATTENTIONRMSNORMSWIGLUROPE POSITIONSCOSINE ANNEALINGGRADIENT ACCUMULATIONSELF-EVOLUTIONVOID STUDIOcuBLAS SGEMMAUTOGRAD ENGINEKV CACHE303M PARAMS810 TOK/S62 SOURCE FILES

Lines of Code

Parameters

Tok/s GPU

20,000+

Lines of Code

62

Source Files

303M

Parameters

void-350m

32,000

Vocab Size

BPE tokens

RTX 5060

GPU

68% VRAM · 60% Util

810

Throughput

tok/s (GPU)

🧠

Transformer Architecture

GPT-Style

TokenizerBPE · 8K vocab

↓

EmbeddingToken + RoPE pos

↓

Transformer ×4GQA + SwiGLU FFN

↓

RMSNormPre-norm residual

↓

LM HeadTied weights

↓

SoftmaxNext token probs

d_model768

n_layers12

n_heads12

d_ff2048

max_seq_len2048

rope_base10,000

📊

Training Status

In Progress

Hyperparameters

Learning Rate6e-4

Batch Size4 (× 8 accum)

Weight Decay0.1

Grad Clip1.0

β₁ / β₂0.9 / 0.95

Warmup Steps500

Total Steps100,000

SchedulerCosine Annealing

GPU Telemetry

GPUNVIDIA RTX 5060

VRAM Used5,562 MB

Throughput810 tok/s

Current Step10 / 100,000

Prev Modelvoid-tiny: loss 4.65 @ step 3662

🚀

Model Lineup

Void Family

Void-TinyComplete

Parameters~5M

d_model128

Layers4

Heads4 (MHA)

Void-125MTraining

Parameters~125M

d_model768

Layers12

Heads12 (MHA)

Void-1BPlanned

Parameters~1B

d_model2048

Layers24

Heads16/4 (GQA)

Void-3BPlanned

Parameters~3B

d_model3200

Layers26

Heads32/8 (GQA)

🔢

Custom Tensor Engine

No PyTorch

◈

Tensor Core

▸N-dimensional storage with contiguous/strided layouts
▸Shape broadcasting & automatic reshape
▸CPU ↔ CUDA device transfer
▸In-place and out-of-place operations
▸Lazy computation with fused kernels

◎

Operations (100+)

▸MatMul, BatchMatMul, BMM with transpose
▸Softmax, LogSoftmax, GELU, SiLU/SwiGLU
▸LayerNorm, RMSNorm, Dropout
▸RoPE positional encoding
▸Cross-entropy loss with label smoothing

∂

Autograd Engine

▸Reverse-mode automatic differentiation
▸Dynamic computation graph
▸Gradient accumulation & clipping
▸Memory-efficient checkpointing
▸Custom backward for attention & FFN

⚡

CUDA Acceleration

▸cudarc 0.19 — safe Rust bindings
▸cuBLAS SGEMM/DGEMM for matmul
▸Custom NVRTC-compiled kernels
▸Async memory copies & streams
▸Pinned host memory for transfers

🧬

Self-Evolution Engine

Autonomous

Void includes a built-in autonomous evolution system that can mutate its own architecture — adding/removing layers, adjusting attention heads, modifying FFN dimensions — then evaluating fitness and selecting the best performing variants. This is the foundation for self-improving AI.

autonomous.rs

301 LOC — Architecture search & mutation scheduling

fitness.rs

268 LOC — Multi-objective fitness: loss, speed, memory

mutator.rs

254 LOC — Layer insertion, head pruning, dim scaling

sandbox.rs

192 LOC — Isolated evaluation with rollback safety

🖥️

Void Studio

Native GUI

Real-time GPU-accelerated training dashboard built with egui/glow. Monitors loss curves, learning rate schedules, GPU telemetry, and generation output — all running natively at 60fps with zero web overhead.

◆Live loss chart

◆LR schedule viz

◆GPU temp / VRAM

◆Generation preview

◆Training controls

◆Model config editor

◆Data pipeline view

◆Evolution monitor

◆Checkpoint manager

📦

Dependency Stack

10 crates

cudarc0.19Safe CUDA bindings — nvrtc, cuBLAS, driver API

rayon1.10CPU parallelism — data loading, tokenization

memmap20.9Memory-mapped files — 0-copy dataset access

eframe/egui0.31GPU-accelerated native GUI — training dashboard

serde + toml1 / 0.8Config serialization — model hyperparameters

indicatif0.17Training progress bars — ETA, throughput

half2FP16 half-precision — reduced VRAM

clap4CLI — train, generate, bench subcommands

ureq2HTTP — auto-fetch Shakespeare/training data

chrono0.4Timestamps — checkpoint naming, logs

📁

Source Inventory

62 files · 20,000+ LOC

app.rsstudio/

1254

ops.rstensor/

1057

main.rssrc/

857

dispatch.rsgpu/

793

advanced.rstraining/

694

backward.rstraining/

654

optimizer.rstraining/

453

mod.rstensor/

447

synthetic.rsdata/

436

attention.rsnn/

433

state.rsstudio/

330

transformer.rsnn/

324

kernels.rsgpu/

322

domains.rsdata/

316

config.rsmodel/

314

cuda.rsgpu/

310

autonomous.rsevolution/

301

dashboard.rspanels/

299

web_fetcher.rstraining/

277

fitness.rsevolution/

268

bpe.rstokenizer/

263

training.rspanels/

260

mutator.rsevolution/

254

pipeline.rsdata/

233

trainer.rstraining/

226

metrics.rssrc/

225

norm.rsnn/

224

autograd.rstensor/

199

bridge.rssrc/

193

sandbox.rsevolution/

192

generate.rsmodel/

189

gpu_monitor.rspanels/

176

shape.rstensor/

164

curriculum.rsdata/

163

dataloader.rstraining/

159

model_config.rspanels/

147

checkpoint.rstraining/

146

storage.rstensor/

136

activation.rsnn/

126

linear.rsnn/

113

scheduler.rstraining/

🦀

Why Rust for Machine Learning?

Zero-Cost Abstractions

Rust's type system and ownership model produce code that compiles to the same machine instructions as hand-tuned C — with full memory safety guarantees. No garbage collector pauses during training.

Fearless Concurrency

Data races are compile-time errors in Rust. Rayon parallelism for CPU-bound data loading, async CUDA stream management, and lock-free metric reporting — all verified at compile time.

Single Binary Deployment

Void compiles to a single static binary. No conda environments, no pip dependencies, no virtualenvs, no CUDA version mismatches. Just run the binary.

Native CUDA Integration

cudarc provides safe Rust bindings to the CUDA driver API — device management, memory allocation, kernel launches, cuBLAS — without unsafe blocks leaking into application code.