← Infinity Tech Stack

VOID

A from-scratch LLM training engine built entirely in Rust with CUDA GPU acceleration. No PyTorch. No Python. Pure native performance — every tensor op, every gradient, every CUDA kernel hand-written.

RustCUDARayon ParallelZero Dependencies*Self-Evolving94 Tests

*No ML framework dependency — custom tensor + autograd library

CUDA KERNELSCUSTOM TENSOR ENGINEBPE TOKENIZERGROUPED QUERY ATTENTIONRMSNORMSWIGLUROPE POSITIONSCOSINE ANNEALINGGRADIENT ACCUMULATIONSELF-EVOLUTIONVOID STUDIOcuBLAS SGEMMAUTOGRAD ENGINEKV CACHE303M PARAMS810 TOK/S62 SOURCE FILESCUDA KERNELSCUSTOM TENSOR ENGINEBPE TOKENIZERGROUPED QUERY ATTENTIONRMSNORMSWIGLUROPE POSITIONSCOSINE ANNEALINGGRADIENT ACCUMULATIONSELF-EVOLUTIONVOID STUDIOcuBLAS SGEMMAUTOGRAD ENGINEKV CACHE303M PARAMS810 TOK/S62 SOURCE FILES
0+
Lines of Code
0M
Parameters
0
Tok/s GPU
20,000+
Lines of Code
62
Source Files
303M
Parameters
void-350m
32,000
Vocab Size
BPE tokens
RTX 5060
GPU
68% VRAM · 60% Util
810
Throughput
tok/s (GPU)
🧠

Transformer Architecture

GPT-Style
TokenizerBPE · 8K vocab
EmbeddingToken + RoPE pos
Transformer ×4GQA + SwiGLU FFN
RMSNormPre-norm residual
LM HeadTied weights
SoftmaxNext token probs
d_model768
n_layers12
n_heads12
d_ff2048
max_seq_len2048
rope_base10,000
📊

Training Status

In Progress

Hyperparameters

Learning Rate6e-4
Batch Size4 (× 8 accum)
Weight Decay0.1
Grad Clip1.0
β₁ / β₂0.9 / 0.95
Warmup Steps500
Total Steps100,000
SchedulerCosine Annealing

GPU Telemetry

0%PROGRESS
68%VRAM
60%GPU UTIL
GPUNVIDIA RTX 5060
VRAM Used5,562 MB
Throughput810 tok/s
Current Step10 / 100,000
Prev Modelvoid-tiny: loss 4.65 @ step 3662
🚀

Model Lineup

Void Family
Void-TinyComplete
Parameters~5M
d_model128
Layers4
Heads4 (MHA)
Void-125MTraining
Parameters~125M
d_model768
Layers12
Heads12 (MHA)
Void-1BPlanned
Parameters~1B
d_model2048
Layers24
Heads16/4 (GQA)
Void-3BPlanned
Parameters~3B
d_model3200
Layers26
Heads32/8 (GQA)
🔢

Custom Tensor Engine

No PyTorch

Tensor Core

  • N-dimensional storage with contiguous/strided layouts
  • Shape broadcasting & automatic reshape
  • CPU ↔ CUDA device transfer
  • In-place and out-of-place operations
  • Lazy computation with fused kernels

Operations (100+)

  • MatMul, BatchMatMul, BMM with transpose
  • Softmax, LogSoftmax, GELU, SiLU/SwiGLU
  • LayerNorm, RMSNorm, Dropout
  • RoPE positional encoding
  • Cross-entropy loss with label smoothing

Autograd Engine

  • Reverse-mode automatic differentiation
  • Dynamic computation graph
  • Gradient accumulation & clipping
  • Memory-efficient checkpointing
  • Custom backward for attention & FFN

CUDA Acceleration

  • cudarc 0.19 — safe Rust bindings
  • cuBLAS SGEMM/DGEMM for matmul
  • Custom NVRTC-compiled kernels
  • Async memory copies & streams
  • Pinned host memory for transfers
🧬

Self-Evolution Engine

Autonomous

Void includes a built-in autonomous evolution system that can mutate its own architecture — adding/removing layers, adjusting attention heads, modifying FFN dimensions — then evaluating fitness and selecting the best performing variants. This is the foundation for self-improving AI.

autonomous.rs
301 LOC — Architecture search & mutation scheduling
fitness.rs
268 LOC — Multi-objective fitness: loss, speed, memory
mutator.rs
254 LOC — Layer insertion, head pruning, dim scaling
sandbox.rs
192 LOC — Isolated evaluation with rollback safety
🖥️

Void Studio

Native GUI

Real-time GPU-accelerated training dashboard built with egui/glow. Monitors loss curves, learning rate schedules, GPU telemetry, and generation output — all running natively at 60fps with zero web overhead.

Live loss chart
LR schedule viz
GPU temp / VRAM
Generation preview
Training controls
Model config editor
Data pipeline view
Evolution monitor
Checkpoint manager
📦

Dependency Stack

10 crates
cudarc0.19Safe CUDA bindings — nvrtc, cuBLAS, driver API
rayon1.10CPU parallelism — data loading, tokenization
memmap20.9Memory-mapped files — 0-copy dataset access
eframe/egui0.31GPU-accelerated native GUI — training dashboard
serde + toml1 / 0.8Config serialization — model hyperparameters
indicatif0.17Training progress bars — ETA, throughput
half2FP16 half-precision — reduced VRAM
clap4CLI — train, generate, bench subcommands
ureq2HTTP — auto-fetch Shakespeare/training data
chrono0.4Timestamps — checkpoint naming, logs
📁

Source Inventory

62 files · 20,000+ LOC
app.rsstudio/
1254
ops.rstensor/
1057
main.rssrc/
857
dispatch.rsgpu/
793
advanced.rstraining/
694
backward.rstraining/
654
optimizer.rstraining/
453
mod.rstensor/
447
synthetic.rsdata/
436
attention.rsnn/
433
state.rsstudio/
330
transformer.rsnn/
324
kernels.rsgpu/
322
domains.rsdata/
316
config.rsmodel/
314
cuda.rsgpu/
310
autonomous.rsevolution/
301
dashboard.rspanels/
299
web_fetcher.rstraining/
277
fitness.rsevolution/
268
bpe.rstokenizer/
263
training.rspanels/
260
mutator.rsevolution/
254
pipeline.rsdata/
233
trainer.rstraining/
226
metrics.rssrc/
225
norm.rsnn/
224
autograd.rstensor/
199
bridge.rssrc/
193
sandbox.rsevolution/
192
generate.rsmodel/
189
gpu_monitor.rspanels/
176
shape.rstensor/
164
curriculum.rsdata/
163
dataloader.rstraining/
159
model_config.rspanels/
147
checkpoint.rstraining/
146
storage.rstensor/
136
activation.rsnn/
126
linear.rsnn/
113
scheduler.rstraining/
87
🦀

Why Rust for Machine Learning?

Zero-Cost Abstractions

Rust's type system and ownership model produce code that compiles to the same machine instructions as hand-tuned C — with full memory safety guarantees. No garbage collector pauses during training.

Fearless Concurrency

Data races are compile-time errors in Rust. Rayon parallelism for CPU-bound data loading, async CUDA stream management, and lock-free metric reporting — all verified at compile time.

Single Binary Deployment

Void compiles to a single static binary. No conda environments, no pip dependencies, no virtualenvs, no CUDA version mismatches. Just run the binary.

Native CUDA Integration

cudarc provides safe Rust bindings to the CUDA driver API — device management, memory allocation, kernel launches, cuBLAS — without unsafe blocks leaking into application code.