My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling. And finally, we also load openly available pretrained weights into our scratch-built model architecture. Along with this pretraining tutorial, I also have bonus material on speeding up the LLM training. These apply not just to LLMs but also to other transformer-based models like vision transformers: 1. Instead of saving the causal mask, this creates the causal mask on the fly to reduce memory usage (here it has minimal effect, but it can add up in long-context size models like Llama 3.2 with 131k-input-tokens support) 2. Use tensor cores (only works for Ampere GPUs like A100 and newer) 3. Use the fused CUDA kernels for `AdamW` by setting 4. Pre-allocate and re-use GPU memory via the pinned memory setting in the data loader 5. Switch from 32-bit float to 16-bit brain float (bfloat16) precision 6. Replace from-scratch implementations of attention mechanisms, layer normalizations, and activation functions with PyTorch counterparts that have optimized CUDA kernels 7. Use FlashAttention for more efficient memory read and write operations 8. Compile the model 9. Optimize the vocabulary size 10. After saving memory with the steps above, increase the batch size Video tutorial: https://lnkd.in/gDRycWea PyTorch speed-ups: https://lnkd.in/gChvGCJH
Common Pytorch Memory Management Strategies
Explore top LinkedIn content from expert professionals.
Summary
Common PyTorch memory management strategies are practical methods used by developers to reduce memory usage and prevent crashes when training or deploying models with PyTorch. These approaches help make models run smoothly on available hardware, even when dealing with large datasets or complex architectures.
- Use gradient checkpointing: Store only important intermediate results during training and recompute others as needed, which lets you train larger models without running out of memory.
- Streamline data pipelines: Set up asynchronous data loading and remove unnecessary operations to keep your GPU busy and prevent memory waste during training.
- Switch to mixed precision: Run calculations using lower-precision numbers where possible to cut memory usage and speed up computation without sacrificing accuracy.
-
-
You're in a Senior ML Interview at NVIDIA. The interviewer sets a trap: "Your 7B model fits comfortably on a 24GB GPU. Yet, 10 minutes into a conversation, the service crashes with an Out-Of-Memory (OOM) error. Do we upgrade to an A100?" 90% of candidates walk right into it: "Yes, we need more VRAM." They think: "The model is running out of space, so we need a bigger bucket." This is the "Brute Force" approach. It solves the symptom for exactly one week until their users type longer prompts, and then they crash an 80GB card too. They just 4x'd the cloud bill without solving the physics of the problem. The reality is that they aren't optimizing for 𝐒𝐭𝐚𝐭𝐢𝐜 𝐌𝐞𝐦𝐨𝐫𝐲 (𝐖𝐞𝐢𝐠𝐡𝐭𝐬). They are dying from 𝐃𝐲𝐧𝐚𝐦𝐢𝐜 𝐒𝐭𝐚𝐭𝐞 (𝐂𝐨𝐧𝐭𝐞𝐱𝐭). In a production environment, GPU memory is consumed by two things: - 𝘔𝘰𝘥𝘦𝘭 𝘞𝘦𝘪𝘨𝘩𝘵𝘴: Fixed. (e.g., ~14GB for a 7B param model in FP16). - 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦: Variable. This grows linearly with every single token generated. A 7B model with a batch size of 64 and a context length of 2048 tokens can generate over 30GB of KV cache. The "Ghost Memory" is larger than the model itself. ----- 𝐓𝐡𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧: The real problem isn't just the size of the cache - it's Memory Fragmentation. Standard PyTorch allocates contiguous memory blocks. As requests grow and shrink, they leave "holes" in your VRAM that are too small to use but add up to gigabytes of wasted space. This is The Swiss Cheese Effect. The fix isn't hardware. It's Architecture: 1️⃣ 𝘗𝘢𝘨𝘦𝘥𝘈𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 (𝘷𝘓𝘓𝘔): Treat GPU memory like an Operating System treats RAM. Break the KV cache into non-contiguous "pages" so you can fill every byte of VRAM without needing a continuous block. 2️⃣ 𝘒𝘝 𝘊𝘢𝘤𝘩𝘦 𝘖𝘧𝘧𝘭𝘰𝘢𝘥𝘪𝘯𝘨: If a user pauses for 30 seconds, move their KV cache to CPU RAM (cheap) and swap it back to GPU (expensive) only when they type again. 𝐓𝐡𝐞 𝐀𝐧𝐬𝐰𝐞𝐫 𝐓𝐡𝐚𝐭 𝐆𝐞𝐭𝐬 𝐘𝐨𝐮 𝐇𝐢𝐫𝐞𝐝: "Buying GPUs is a band-aid. The bottleneck is the KV Cache growing linearly with context. I would implement PagedAttention to eliminate memory fragmentation and KV Offloading to handle idle sessions. We only upgrade hardware if the active computation, not the idle state, saturates the compute units." #MachineLearning #DeepLearning #GenerativeAI #LLM #AIEngineering #MLOps #NVIDIA
-
87GB of VRAM on an H200 for a single forward pass of SAM3. That's what you get if you take Meta's inference notebook, wrap it in a script, and run it. It works. It's also doing a lot of things you never asked it to do. I spent some time profiling SAM3 with Nsight Systems and wrote up everything I found. Full article linked in the comments 👇. Here's the short version — pure PyTorch, no ONNX, no TensorRT, no quantization: ➤ 8x throughput at my default batch size of 16 images ➤ 87GB → 23GB peak memory (cheap GPU class instead of H200) ➤ 100 prompts per image is now faster than 10 prompts in the baseline Five categories of fixes got it there: 1. Async data pipeline. The GPU was sitting idle between batches. Obvious in the profiler, trivial to fix with DataLoader workers and an async writer process. 2. Dead code in the model. Universal segmentation head, geometry encoder, training-time assertions, multi-task outputs that nobody reads downstream. Tracing what the model actually does vs. what it was built to do was the single largest memory win. 3. Tensors taking the scenic route. GPU scalars flowing into torch.arange, comparisons, and assertions — each one a silent CPU round-trip mid-inference. Four issues, all traceable to one upstream tensor placed on GPU for no reason. 4. Post-processing tax. 160 boolean-masked gathers per batch, each forcing a compaction kernel and a blocking DtoH. Replacing with masked_fill collapsed 53ms of GPU work into 37µs. 5. Mixed precision ghosts. LayerNorm silently falls back to fp32 under AMP. transformer_engine's drop-in replacement recovered ~70ms across three modules. One import swap, no retraining. The non-obvious lesson: understanding the gap between what a model can do and what you actually need it to do is most of the work. The profiler tells you where time is going. That architectural understanding tells you what's safe to cut. 👇 Full walkthrough with Nsight timelines, code, and the trace-the-tensor-upstream story is in the article — friend link in the first comment. What's the most interesting thing you've found hiding in a profiler timeline?
-
This is a well-structured, and practical deep dive into PyTorch performance tuning and best practices. It covers proven techniques like mixed precision, torch.compile, inference optimizations, channels-last memory format, and activation checkpointing — all aimed at squeezing maximum performance from your models. It also includes practical coding tips and data pipeline advice to ensure your PyTorch code runs fast, uses less memory, and scales effectively. Link: https://lnkd.in/gVzHxsEX
-
Activation checkpointing (also called gradient checkpointing) saves GPU memory by storing just a small set of “checkpoint” activations during the forward pass instead of keeping every intermediate tensor. When the backward pass needs those missing activations, it briefly re-runs the corresponding forward computations to regenerate them. This recomputation trades extra compute for a much lower memory footprint, making it possible to train larger models or increase batch sizes. PyTorch’s torch.utils.checkpoint and libraries like FairScale or DeepSpeed automate the choice of which layers to checkpoint, balancing memory savings against the slowdown caused by the additional forward passes.
-
I used to overthink distributed training setup. ZeRO. FSDP. Tensor Parallelism. Pipeline Parallelism. Most ML engineers never train past a single GPU. Then they hit OOM and panic. Here's the decision tree that saves you 3 days of debugging: → Model fits on 1 GPU? Just train → OOM on activations? Gradient checkpointing → OOM on parameters? ZeRO-3 or FSDP → 70B+ model? Tensor Parallelism → Long sequences (32K+)? Context Parallelism → Multi-node + max throughput? 3D/5D Parallelism The details that matter: 1️⃣ Gradient checkpointing is free lunch Trade compute for memory. Recompute activations during backward pass. Always try this first. 2️⃣ ZeRO-3 vs FSDP ZeRO-3 = DeepSpeed. FSDP = PyTorch native. Same idea: shard optimizer states, gradients, parameters across GPUs. FSDP is catching up. ZeRO-3 still has more knobs. 3️⃣ Tensor Parallelism for massive models Split layers across GPUs. Communication-heavy. Works best within a node (NVLink). 4️⃣ Pipeline Parallelism for depth Split model stages across GPUs. Micro-batching hides latency. 5️⃣ 5D Parallelism is the endgame Data + Tensor + Pipeline + Context + Expert. Only for 100B+ scale. Most engineers will never need this. The uncomfortable truth: Most OOMs are solved by gradient checkpointing + smaller batch size. Not a new parallelism strategy. Where does your training usually break? 👇 💾 Save this for your next OOM at 3am ♻️ Repost for someone who thinks they need 8 GPUs
-
Supercharge Your Model Training: Essential Techniques and Tricks 🚀 Are you tired of long model training times and inefficient training process? I have always struggled to understand which techniques can be chained together towards cumulative improvement and the order of magnitude improvement from each. Here is an array of powerful techniques to accelerate training with their effect size. The key in most cases is to know the memory architecture for the GPU 💾 and utilize it optimally by reducing data movement between on chip registers, cache, and off chip high-bandwidth memory. Frameworks like PyTorch make this pretty simple allowing you to do this in a few lines of code at most. - Switch to Mixed Precision: 🔢 Implementing bfloat16 can lead to a potential 3x speedup by reducing the amount of data transferred, thus enabling larger batch sizes. Although GPUs may promise up to an 8x improvement, actual gains could be lower due to memory constraints. Benchmarking is essential! - PyTorch Compile: 🖥️ Experience about a 2.5x speed increase by minimizing unnecessary memory bus traffic. This approach prepares your computations for more efficient execution. - Flash Attention: ⚡ Utilize a fused kernel specifically optimized for attention-heavy models, which can boost performance by up to 40% by enhancing memory hierarchy utilization. - Optimized Data Formats: 📊 Aligning your vocab size to a power of 2 can provide a straightforward 10% speed boost by improving memory access efficiency. - Hyperparameter Tuning: 🛠️ Gain an additional 5-10% speed by tweaking hyperparameters and employing fused kernels for optimizers like AdamW. Bespoke Fused Kernels: 🧩 Push the boundaries with custom kernels designed specifically for your model’s architecture to achieve optimal performance. Leverage Additional Optimizations: ➕ Employ vector operations (e.g., AVX-512) on CPUs or use sparse kernels for pruned models to further enhance memory efficiency. Scale Responsibly: 📈 Before moving to a multi-GPU setup, ensure you've maximized the potential of single-GPU optimizations to avoid inefficiencies. Once your setup is optimized, scaling across multiple GPUs can dramatically reduce training times by parallelizing the workload and minimizing data transfers. You can do this almost trivially by using things like Hugging Face Accelerate. Remember, the effectiveness of these techniques can vary based on your specific model, hardware setup, and other variables. Extensive benchmarking is crucial to find the perfect balance between speed and accuracy. Optimization is a continuous journey. Stay proactive in exploring new methods to reduce training times and remain competitive in the fast-evolving field of machine learning. For more insights, check out Karpathy’s latest video where he replicates GPT-2 on 8x A100s, astonishingly beating GPT-3 on Hellaswag. It’s incredible to see such advancements, allowing what once took months to be accomplished virtually overnight. 🌙✨
-
Everyone uses PyTorch. Almost no one uses it like this. Here are 28 tricks the top 1% of engineers use regularly: ✅ Tensor Tricks 🔸 torch.einsum ↳ Compact tensor math with Einstein notation 🔸 torch.as_strided ↳ Create advanced views 🔸 tensor.unfold ↳ Sliding window views 🔸 torch.roll ↳ Circular shift (augmentations, Fourier ops) 🔸 In-place ops () ↳ Memory efficient Example: x.add(5) ✅ Autograd & Hooks 🔸 x.register_hook ↳ Tensor hook to inspect/modify gradients 🔸 model.layer.register_forward_hook ↳ Capture activations (forward hook) 🔸 Custom autograd.Function ↳ Define custom forward/backward passes ✅ Training Hacks 🔸 loss.backward (with accumulation) ↳ Virtual larger batches 🔸 torch.cuda.amp.autocast ↳ Mixed precision, faster & less memory 🔸 p.requires_grad = False ↳ Freeze layers 🔸 optimizer.zero_grad(set_to_none=True) ↳ Faster gradient reset, lower memory traffic ✅ Data Handling 🔸 torch.utils.data.Subset ↳ Split datasets 🔸 torch.utils.data.ConcatDataset ↳ Merge datasets 🔸 worker_init_fn ↳ Seed DataLoader workers (reproducibility) 🔸 pin_memory=True ↳ Faster CPU→GPU transfers (CUDA only) ✅ Debugging & Monitoring 🔸 torch.autograd.set_detect_anomaly ↳ Detect NaNs in backward pass 🔸 torch.cuda.empty_cache ↳ Clear GPU cache 🔸 torch.cuda.memory_summary ↳ GPU memory usage report ✅ Performance Hacks 🔸 torch.compile ↳ Speed up Python training 🔸 torch.export ↳ Export for deployment 🔸 torch.utils.checkpoint.checkpoint ↳ Gradient checkpointing (save memory) 🔸 model. to(memory_format=torch.channels_last) ↳ Faster ConvNets on GPU 🔸 torch.vmap ↳ Vectorized function application ✅ Utility & Reproducibility 🔸 torch.manual_seed ↳ Set random seed (CPU) 🔸 torch.cuda.manual_seed_all ↳ Set random seed (GPU) 🔸 torch.use_deterministic_algorithms(True) ↳ Enforce strict determinism 🔸 torch.inference_mode ↳ Faster eval than no_grad Without these, you’re leaving performance, insight, and elegance on the table. Which of these are you going to use from today? Want to go deeper? Gradient Ascent has you covered. Join 25k+ readers from Google, Meta, Netflix, and over 164+ countries worldwide.