Reproduced GPT-2 Small from scratch in PyTorch No pre-trained weights. No high-level abstractions. I implemented and trained a full replica of the GPT-2 Small architecture end-to-end on the TinyStories dataset, with a focus on first-principles understanding of Transformer systems. What this involved - Rebuilding a 12-layer Transformer (12 heads, d_model = 768) from the ground up - Implementing multi-head self-attention with explicit Q/K/V projections + causal masking - Designing a Pre-LayerNorm residual architecture for training stability - Constructing token + positional embedding pipelines - Developing an autoregressive generation loop (sampling, temperature, context windowing) All components — including training loop, checkpointing, and inference — were implemented manually in PyTorch. Results - Trained on ~100k sequences (context length = 128) - Final loss: 1.57 - ~12 hours training on Kaggle P100 GPU - Coherent, structured short narrative generation from raw prompts Why this matters Most projects use Transformers. This project focuses on reproducing and understanding them at the systems level: - Attention as tensor operations (not black-box APIs) - Training stability (normalization, residual flow) - Sequence modeling constraints (causal masking, context limits) - Sampling strategies and their impact on output quality 🔗 What’s next - PPO-based fine-tuning (policy + value heads) - KL-regularized training vs reference models - Reward modeling and alignment pipelines 🔗 GitHub Repository: https://lnkd.in/gr4fd7jY More advanced projects coming soon insha’Allah.
Great job Reda keep going 👏👏
fantastic job buddy
Great job
Très bon travail reda bravo