Pretraining Strategies for Large Language Models

Explore top LinkedIn content from expert professionals.

Summary

Pretraining strategies for large language models involve methods to train these AI systems on massive datasets to understand and generate human-like text. These techniques optimize both the training process and resource requirements, ultimately improving model performance and making advanced AI more accessible.

  • Focus on efficient compute: Use techniques like memory optimization, mixed precision training (e.g., bfloat16), and tensor cores to accelerate training while reducing resource consumption.
  • Explore alternative data strategies: Utilize smaller, lower-resource models for fine-tuning or synthetic data generation to improve training efficiency and overcome data limitations.
  • Rethink scaling methods: Consider evolving traditional scaling laws by implementing techniques such as sparse model architectures, modular components like mixture-of-experts, or even scaling test-time compute capabilities.
Summarized by AI based on LinkedIn member posts
  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    206,886 followers

    My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling. And finally, we also load openly available pretrained weights into our scratch-built model architecture. Along with this pretraining tutorial, I also have bonus material on speeding up the LLM training. These apply not just to LLMs but also to other transformer-based models like vision transformers: 1. Instead of saving the causal mask, this creates the causal mask on the fly to reduce memory usage (here it has minimal effect, but it can add up in long-context size models like Llama 3.2 with 131k-input-tokens support) 2. Use tensor cores (only works for Ampere GPUs like A100 and newer) 3. Use the fused CUDA kernels for `AdamW` by setting 4. Pre-allocate and re-use GPU memory via the pinned memory setting in the data loader 5. Switch from 32-bit float to 16-bit brain float (bfloat16) precision 6. Replace from-scratch implementations of attention mechanisms, layer normalizations, and activation functions with PyTorch counterparts that have optimized CUDA kernels 7. Use FlashAttention for more efficient memory read and write operations 8. Compile the model 9. Optimize the vocabulary size 10. After saving memory with the steps above, increase the batch size Video tutorial: https://lnkd.in/gDRycWea PyTorch speed-ups: https://lnkd.in/gChvGCJH

  • View profile for Charles H. Martin, PhD

    AI Specialist and Distinguished Engineer (NLP & Search). Inventor of weightwatcher.ai . TEDx Speaker. Need help with AI ? #talkToChuck

    45,027 followers

    🔫 𝐎𝐧𝐞-𝐬𝐡𝐨𝐭 𝐄𝐧𝐭𝐫𝐨𝐩𝐲 𝐌𝐢𝐧𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 (𝐄𝐌) 𝐫𝐞𝐩𝐥𝐚𝐜𝐞𝐬 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 𝐢𝐧 𝐋𝐋𝐌 𝐩𝐨𝐬𝐭-𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠: "We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled datum and 10-step optimization to achieve performance improvements greater than those obtained using thousands of examples and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models." 📊 𝗘𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁𝗮𝗹 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀: • 13,440 models tested: 1 unlabeled example + 10 EM steps beats RL by +24.7 avg math, +25.8 MATH500, +26.2 AMC23 • Fully unsupervised: no labels or rewards, just EM • Assumes high-quality pretrained checkpoints 🔎 𝗞𝗲𝘆 𝗙𝗶𝗻𝗱𝗶𝗻𝗴𝘀: 𝑳𝒐𝒈𝒊𝒕𝒔 𝒃𝒆𝒄𝒐𝒎𝒆 𝒉𝒆𝒂𝒗𝒚-𝒕𝒂𝒊𝒍𝒆𝒅 • Heavier tail → better answers • Best at ≤ 10 steps; works pre-RL, not post • > 10 steps degrade performance • Better base, better results (e.g., Qwen2.5-Math-7B) 🔗 paper: https://lnkd.in/g2yV8qff 🐙 GitHub: https://lnkd.in/gWB5FBzv 🐦 source: https://lnkd.in/gvA92MFs

  • View profile for Aleksei Dolgikh

    Aleksei Dolgikh $DLGH CVO Scout Investors Venture Capital 2025 PE FO LP GP. CALENDAR: tinyurl.com/DOLGIKH GLOCAL ORM: International Search Visibility Transactional Traffic - 24SIX9, ITIL, CNCF, ICANN, GITEX, BANKS, OSINT

    12,427 followers

    AI research community recent paper by Google DeepMind researchers, including Hritik Bansal , Arian Hosseini , Rishabh A. , Vishal M. Patel , and Mehram Kazemi, reveals that training #LargeLanguageModels (#LLMs) with data from smaller, less resource-intensive models can yield superior performance. This approach challenges the conventional wisdom of using larger, more expensive models for fine-tuning. Here's a deep dive into their findings: 🔹 Findings: Models fine-tuned on weaker, cheaper data outperform those trained on stronger data across benchmarks. 🔹 Implications: This could revolutionize how we approach #LLMtraining, making AI more accessible and efficient. 🔹 Key Takeaways: - Compute-optimal sampling from smaller models provides better coverage and diversity. - Efficiency in AI doesn't always require the biggest models. For those interested in AI efficiency and model compression, here are some techniques to consider: 1. #Pruning - Reducing less important weights in neural networks. 2. #Quantization - Lowering the precision of model parameters for efficiency. #ArtificialIntelligence #MachineLearning #DeepLearning #ModelCompression #AIResearch #GoogleDeepMind #HritikBansal #AO

  • View profile for Daniel Han

    Co-founder @ Unsloth AI

    54,355 followers

    Ilya Sutskever gave a talk at NeurIPS about the post pretraining world - here's my talk on his talk - Ilya is implying we need to find something else to scale - the brain–body mass ratio graph in the talk showed human intelligence “scaled” better than mammals. LSTMs got out-scaled by transformers - the goal is to "edit" the scaling laws to make it more efficient. Evolution somehow first tried scaling intelligence for mammals, then pushed the frontier up for non-human primates. Large elephants which exceeded the 700g gram wall were extinct in the end. Then hominids came along and broke the wall, and scaled far better. (A) Kaplan et al’s scaling laws shows if we increase TRAINING compute = N (# parameters) * D (# tokens / data), the test loss also decreases in a log-log setting. (A)* Instead of scaling TRAINING compute, Sutskever mentioned we can scale TEST TIME compute through search, or like O1 / QwQ etc. (B) First on D (scaling data). There exists a theoretical “Data Wall” which is when all the data in the world (the internet and everything else) gets consumed by large models. Once we reach that point, we have to find ways to overcome this barrier to make models to continue to scale. This could mean Synthetic Data Generation as Sutskever mentioned - literally using a trained model to augment datasets. The question is if this will plateau or keep scaling. Another approach is to make data scaling more efficient through better filtering like the FineWeb dataset. We can also do more RL & post-training via DPO, PPO etc to squeeze more performance out of the same amount of tokens. (C) Second on N (# of parameters) - the trick is to move to active parameters instead of total parameters. Large labs like OpenAI replaced MLP / FFNs in Dense transformers with MoE layers . Instead of doing huge matrix multiplies, we smartly only select a few column groups to multiply instead, and leave the rest as 0. Coincidentally Meta released multiple papers including one on Byte Latent Transformers and Memory Layers . BLTs edit the scaling laws itself by changing the definition of “tokens” in data scaling and also adding more to the non embedding parameters. (D) Memory Layers are what really interested me! They are essentially sparse lookup tables - first devised as Product Key layers in Lample et al’s paper we replace the FFN MLP with a gigantic learnable matrix of size (100M, d) called V (Values). We then only select the top K rows of V (say 4) via a weighted sum via the softmax. A long post, but my final talk is Ilya is saying we need to find something else to scale. This could be: 1) Scaling instead test time compute via search, agents, O1 style 2) Changing the arch by holding training compute constant like MoEs, Memory+ layers etc 3) Changing the scales for scaling laws ie like BLTs 4) Breaking the Data Wall via Synthetic Data Generation, RL, filtering etc 5) Or something else! You can watch Ilya's talk here: https://lnkd.in/gPS7mtsm

Explore categories