Reda MALKI’s Post

1mo

Reproduced GPT-2 Small from scratch in PyTorch No pre-trained weights. No high-level abstractions. I implemented and trained a full replica of the GPT-2 Small architecture end-to-end on the TinyStories dataset, with a focus on first-principles understanding of Transformer systems. What this involved - Rebuilding a 12-layer Transformer (12 heads, d_model = 768) from the ground up - Implementing multi-head self-attention with explicit Q/K/V projections + causal masking - Designing a Pre-LayerNorm residual architecture for training stability - Constructing token + positional embedding pipelines - Developing an autoregressive generation loop (sampling, temperature, context windowing) All components — including training loop, checkpointing, and inference — were implemented manually in PyTorch. Results - Trained on ~100k sequences (context length = 128) - Final loss: 1.57 - ~12 hours training on Kaggle P100 GPU - Coherent, structured short narrative generation from raw prompts Why this matters Most projects use Transformers. This project focuses on reproducing and understanding them at the systems level: - Attention as tensor operations (not black-box APIs) - Training stability (normalization, residual flow) - Sequence modeling constraints (causal masking, context limits) - Sampling strategies and their impact on output quality 🔗 What’s next - PPO-based fine-tuning (policy + value heads) - KL-regularized training vs reference models - Reward modeling and alignment pipelines 🔗 GitHub Repository: https://lnkd.in/gr4fd7jY More advanced projects coming soon insha’Allah.

GitHub - Reda-MALKI/Reproducing-ChatGPT-2-small-using-P100-GPU-and-2-training-rounds-4-hours-8-hours- github.com

8 Comments

Anass Ighachouten 1mo

Très bon travail reda bravo

NOUREDDINE BLIBLI 1mo

Great job Reda keep going 👏👏

Taha Laina 1mo

fantastic job buddy

Mohamed Drioua 1mo

Great job

See more comments

To view or add a comment, sign in

More Relevant Posts

Yubisono P.

Experienced Credit Specialist with a demonstrated history of working in the Financial Services Industry. Data Scientist and Machine Learnings using Python, SQL, PostgreSQL, Tableau, Pentaho, Chat GPT, Gemini 2.5 Flash
1mo
Report this post
Workflow Experiment Tracking using tensorwatch #machunelearning #datascience #workflowexperimenttracking #tensorwatch Interactive Realtime Debugging and Visualization for AI TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Microsoft Research. It works in Jupyter Notebook to show real-time visualizations of your machine learning training and perform several other key analysis tasks for your models and data. TensorWatch is designed to be flexible and extensible so you can also build your own custom visualizations, UIs, and dashboards. Besides traditional "what-you-see-is-what-you-log" approach, it also has a unique capability to execute arbitrary queries against your live ML training process, return a stream as a result of the query and view this stream using your choice of a visualizer (we call this Lazy Logging Mode). TensorWatch is under heavy development with a goal of providing a platform for debugging machine learning in one easy to use, extensible, and hackable package. https://lnkd.in/gzUcmafE

GitHub - microsoft/tensorwatch: Debugging, monitoring and visualization for Python Machine Learning and Data Science github.com
Like Comment
To view or add a comment, sign in
Modeepx AI Lab

65 followers
4w
Report this post
At Modeepx AI Lab, precision begins with the first line of code. Today, we share a glimpse into our engineer's 'engineering notes' during the construction of the lab's foundational models. Understanding the inner workings of PyTorch and how to manage Parameters within the network is what ensures we build AI systems characterized by high stability and efficiency.
Bayan M Mahfoud

All-Round AI Engineer | Machine Learning | Deep Learning | Computer Vision | Python Developer student
4w

The line that carries the model's 'memory'... and why your model might collapse without it? In the journey of building models from scratch using PyTorch, I've grown accustomed to dividing the work within a class into two rooms: __init__ for setting up layers, and forward for defining the data flow path. However, there's one line that resides at the beginning, which some might consider merely a 'programming protocol' or an optional line: super().__init__() This line is not just a formality; it is the 'tracking system' without which your model remains inert code, devoid of memory. Why is this line the 'backbone' of your model? Without calling super() , PyTorch will not be able to register the layers you define within the network. This line is responsible for initializing the parameter tracking system; those weights and biases that we constantly expect to update during the training process. What happens if you remove it? Simply put, PyTorch will find no place to record what the model has 'learned.' You will build the layers, and you will define the path, but at the moment of training, the system will discover that it has no 'record' to update the weights. You have built the body, but you forgot to build the 'nervous system' that transmits signals. In AI engineering, it's the small details that make the difference between a model that 'works' and a model that 'learns.' In my lab, I don't just write code; I design systems capable of evolution. What is the line of code whose importance you discovered 'the hard way' while building your own architecture? #PyTorch #AI_Engineering #DeepLearning #ModeepxAI #BuildInPublic #CodingSecrets
Like Comment
To view or add a comment, sign in
Aditya Dasika
3w Edited
Report this post
I recently built a mini-GPT entirely from scratch in PyTorch. No Hugging Face wrappers. No API calls. Just the core Transformer architecture implemented from the ground up. Here is what went into it: - Token & Positional Embeddings - Masked Self-Attention & Multi-Head Attention - Feed-Forward Blocks & Residual Connections - Layer Normalization & Autoregressive Generation The Specs & Results: • Architecture: ~0.21M parameters • Dataset: Tiny Shakespeare • Training: After 5,000 steps, validation loss dropped from ~4.40 to ~1.83. • Output: Despite its size, the model successfully learned character names, dialogue formatting, archaic phrasing, and sentence rhythm. Following Andrej Karpathy’s approach remains one of the best ways to build a deep, working intuition for language models. Stripping away the modern abstractions makes concepts like causal attention, context windows, and token prediction highly tangible. This was a great reminder that LLMs aren't magic. At their core, they are just matrix multiplications, attention masks, probability distributions, gradient descent, and scale. Small model. Big learning. Check out the repo at https://lnkd.in/dsu9RZpy #PyTorch #MachineLearning #ArtificialIntelligence #LLMs #Transformers #DeepLearning #GenerativeAI #AIEngineering #SoftwareEngineering

1 Comment
Like Comment
To view or add a comment, sign in
Yubisono P.

Experienced Credit Specialist with a demonstrated history of working in the Financial Services Industry. Data Scientist and Machine Learnings using Python, SQL, PostgreSQL, Tableau, Pentaho, Chat GPT, Gemini 2.5 Flash
1mo
Report this post
Reinforcement Learning using tf agents #machinelearning #datascience #reinforcementlearning #tfagents TF-Agents is an open-source library built on TensorFlow that facilitates the development and testing of RL algorithms. It provides a comprehensive suite of tools, including pre-implemented algorithms, utilities for environment interaction, and support for policy evaluation and optimization. TF-Agents makes designing, implementing and testing new RL algorithms easier, by providing well tested modular components that can be modified and extended. It enables fast code iteration, with good test integration and benchmarking. TF-Agents makes implementing, deploying, and testing new Bandits and RL algorithms easier. It provides well tested and modular components that can be modified and extended. It enables fast code iteration, with good taste integration and benchmarking. https://lnkd.in/gcnn_kAN

GitHub - tensorflow/agents: TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. github.com
Like Comment
To view or add a comment, sign in
Asghar Asghari
3w Edited
Report this post
A local, no-server client for generating embeddings — designed for semantic search and vector comparison in AI systems such as RAG (Retrieval-Augmented Generation). I wrote a Python class that runs Qwen3-Embedding-0.6B directly — no API calls, no embedding server, no extra moving parts. Why? In most RAG or semantic search projects, I had two options: 1. Pay for external APIs (latency + cost + internet dependency) 2. Spin up a dedicated embedding server (overhead for small-to-medium projects) I wanted something dead simple that just works. What this client does: - Auto-detects GPU, CPU, or Apple Silicon (MPS) - FP16 and 4-bit quantization support - Built-in caching for repeated texts - Batched inference (3-5x faster than one-by-one) - Async interface for FastAPI - Cosine similarity and ranking helpers out of the box The result? ~75% less memory with 4-bit, and batch processing that actually scales. "I've put the code on GitHub (link in comments). Would love your thoughts — not just on bugs, but on the architecture itself: - Is batching + caching the right trade-off for most RAG workloads? - Could the async interface be simplified further? - Any obvious optimization I'm missing for CPU-only deployments? Genuinely looking for a second pair of eyes from someone who's done this before." #python #nlp #LLM #SemanticSearch #rag #embeddings #opensource #AI

3 Comments
Like Comment
To view or add a comment, sign in
S. Pratham
4w
Report this post
I built a tool that reads a video with AI so you don't have to.. drop any video file into it, and it gives you a full written breakdown. not a transcript. an actual analysis — what was said, what was shown, key moments, the tone, who was speaking, a chronological narrative. the kind of notes you'd write if you watched it three times and paid close attention. and if you have a specific question, just pass it as a prompt. "what tools were mentioned?" "summarize the main argument." "what products appear in this video?" it'll answer it with evidence pulled from the video itself. here's how you can use it right now: run a video through CLI and get a multi-page summary. import it as a Python module inside your own project. or if your machine can't handle it, there's a Google Colab notebook linked below — free T4 GPU, 16GB VRAM, no setup needed. some things worth knowing: it doesn't sample frames every N seconds like most tools do. it detects actual scene changes using histogram and SSIM comparison, so it only picks the frames that actually matter. it profiles your GPU at runtime and calculates its own batch size. everything runs locally — no API keys, no cloud, nothing leaves your device. i built it because i wanted to understand what's inside a video without watching it. turns out that's a genuinely hard problem to solve cleanly. repo: https://lnkd.in/dmey5a6c colab: https://lnkd.in/djfj26Yq #buildinpublic #python #AI #opensource #MachineLearning — S. Pratham
Like Comment
To view or add a comment, sign in
Natanya Gettineni
3w
Report this post
Been spending time understanding what actually happens inside a Transformer — not through HuggingFace, but by implementing it component by component in PyTorch. Built multi-head attention, positional encoding, the encoder-decoder stack, causal masking, cross-attention — all from scratch. Trained it on a sequence reversal task to verify the full encoder-decoder pipeline actually works end to end. Loss went from 2.58 → 0.30 over 500 epochs. Small task, but it forced me to genuinely understand every moving part before the model learned anything. If you work with LLMs regularly, I'd recommend doing something like this at least once. It changes how you read model architectures. GitHub → https://lnkd.in/gpED_xqv #MachineLearning #DeepLearning #PyTorch #Transformers #NLP

GitHub - cheeseeburger/transformer-from-scratch github.com

2 Comments
Like Comment
To view or add a comment, sign in
ARJUN S
1mo Edited
Report this post
The best way to truly learn something is by building it. Over the past few weeks, I’ve been diving deep into State Space Models (SSMs) — not just reading papers, but implementing them from scratch to understand what’s really happening under the hood. During this process, I explored the NEMOTRON 3 SUPER, a hybrid architecture that combines "Attention", "State Space Models (Mamba-2)", and "Latent Mixture-of-Experts". Thanks to Sebastian Raschka, PhD for sharing the architectural overview. Since I already had a foundation in building Transformers, just wanted to push further and build something closer to modern large-scale systems. So I built a nano-scale version of a Nemotron-style hybrid LLM from scratch in PyTorch. This included setting up a full pretraining pipeline on a Wiki-style dataset using causal language modeling, along with auxiliary multi-token prediction. Also implemented a complete inference pipeline with KV caching, recurrent SSM state updates, and speculative decoding to make generation more efficient. What made this architecture especially interesting is how different components complement each other: ->The "MAMBA-2(SSM)" blocks handle long sequences efficiently using recurrent state updates, avoiding the quadratic cost of attention. ->The "Latent Mixture-of-Experts (MoE)" layers scale the model’s capacity using hundreds of experts, but only activate a small subset per token. By routing in a compressed latent space instead of the full token space, the model keeps computation efficient while improving stability. ->The "Grouped Query Attention (GQA)" layers reduce KV cache memory by sharing key/value heads across multiple query heads, making attention more practical at scale. ->And the "Multi-Token Prediction (MTP)" heads enable 'speculative decoding' — drafting multiple tokens and verifying them in a single pass — which significantly improves inference throughput. The hardest part wasn’t the implementation — it was understanding the math. SSMs require a very different way of thinking compared to attention: - continuous-time dynamics - state update equations - decay matrices, - State space duality (For Mamba2) etc. It took real effort to connect the mathematical intuition with what’s actually happening in code. My biggest takeaway is this: Hybrid models have still a long way to go before outperforming pure attention-based models across the board — but they are incredibly promising. It feels clear that the future isn’t about replacing attention, but combining it with structured sequence modeling approaches like SSMs. If you're exploring SSMs or hybrid LLM architectures, I’ve shared the Colab in the comments — happy to discuss and exchange ideas. #GenAI #LLM #DeepLearning #PyTorch #StateSpaceModels #AI Experion Technologies
2 Comments
Like Comment
To view or add a comment, sign in
Anshul Kumar singh
1mo
Report this post
🚀 Built a 345M Parameter Transformer Model from Scratch… in C (on a Low-End PC) Hi everyone, I’m Anshul Kumar Singh. For the past 6–7 months, I’ve been working on a project that really pushed me to build a transformer-based AI system from scratch using C and CUDA. No PyTorch, no TensorFlow, just trying to understand how things work at the lowest level. 🔧 What I built: • A 345M parameter GPT-style language model • An image generation model using the same core architecture 💻 Hardware: Just 8GB RAM + a GTX 1650 (4GB VRAM) 🧠 How it works (simple idea): If I give input like: “I love machine learning” The model: • Breaks it into tokens → [I, love, machine, learning] • Converts them into vector representations (embeddings) • Passes them through 24 transformer layers Inside these layers, something called self-attention helps each word understand how it relates to the others. Then the model predicts the next word step by step, like: “I love machine learning” → “because it is powerful” 🖼️ Image Generation: I also experimented with using the same transformer core for image generation. Instead of predicting the next word, the model learns patterns in visual data and generates structured outputs showing how the same architecture can be extended beyond text. ⚙️ Some things I worked on: • Implemented the full transformer architecture from scratch • Built the training and inference pipeline myself • Used CUDA to speed things up on GPU • Focused on making it run on limited hardware 💡 Why I built this: Most people think models at this scale need huge servers. I aimed to maximize results with limited resources and better optimization. 🚀 What I want to do next: Make these systems more efficient, so even low-end machines can handle larger models. 🎥 Demo in the video below 🔗 Project Links: • GitHub: https://lnkd.in/g-krtaK3 • Hugging Face: https://lnkd.in/gU8424bR Would love to hear your thoughts or feedback! #AI #MachineLearning #DeepLearning #Transformers #CUDA #CProgramming #BuildInPublic

1 Comment
Like Comment
To view or add a comment, sign in

910 followers

12 Posts

View Profile Follow

Reda MALKI’s Post

More Relevant Posts

Explore related topics

Explore content categories