If you’re an AI engineer, understanding how LLMs are trained and aligned is essential for building high-performance, reliable AI systems. Most large language models follow a 3-step training procedure: Step 1: Pretraining → Goal: Learn general-purpose language representations. → Method: Self-supervised learning on massive unlabeled text corpora (e.g., next-token prediction). → Output: A pretrained LLM, rich in linguistic and factual knowledge but not grounded in human preferences. → Cost: Extremely high (billions of tokens, trillions of FLOPs). → Pretraining is still centralized within a few labs due to the scale required (e.g., Meta, Google DeepMind, OpenAI), but open-weight models like LLaMA 4, DeepSeek V3, and Qwen 3 are making this more accessible. Step 2: Finetuning (Two Common Approaches) → 2a: Full-Parameter Finetuning - Updates all weights of the pretrained model. - Requires significant GPU memory and compute. - Best for scenarios where the model needs deep adaptation to a new domain or task. - Used for: Instruction-following, multilingual adaptation, industry-specific models. - Cons: Expensive, storage-heavy. → 2b: Parameter-Efficient Finetuning (PEFT) - Only a small subset of parameters is added and updated (e.g., via LoRA, Adapters, or IA³). - Base model remains frozen. - Much cheaper, ideal for rapid iteration and deployment. - Multi-LoRA architectures (e.g., used in Fireworks AI, Hugging Face PEFT) allow hosting multiple finetuned adapters on the same base model, drastically reducing cost and latency for serving. Step 3: Alignment (Usually via RLHF) Pretrained and task-tuned models can still produce unsafe or incoherent outputs. Alignment ensures they follow human intent. Alignment via RLHF (Reinforcement Learning from Human Feedback) involves: → Step 1: Supervised Fine-Tuning (SFT) - Human labelers craft ideal responses to prompts. - Model is fine-tuned on this dataset to mimic helpful behavior. - Limitation: Costly and not scalable alone. → Step 2: Reward Modeling (RM) - Humans rank multiple model outputs per prompt. - A reward model is trained to predict human preferences. - This provides a scalable, learnable signal of what “good” looks like. → Step 3: Reinforcement Learning (e.g., PPO, DPO) - The LLM is trained using the reward model’s feedback. - Algorithms like Proximal Policy Optimization (PPO) or newer Direct Preference Optimization (DPO) are used to iteratively improve model behavior. - DPO is gaining popularity over PPO for being simpler and more stable without needing sampled trajectories. Key Takeaways: → Pretraining = general knowledge (expensive) → Finetuning = domain or task adaptation (customize cheaply via PEFT) → Alignment = make it safe, helpful, and human-aligned (still labor-intensive but improving) Save the visual reference, and follow me (Aishwarya Srinivasan) for more no-fluff AI insights ❤️ PS: Visual inspiration: Sebastian Raschka, PhD
Natural Language Goal Alignment in Large Language Models
Explore top LinkedIn content from expert professionals.
Summary
Natural language goal alignment in large language models refers to the process of training AI systems to understand and reliably follow human instructions, preferences, and values when generating text. This ensures that the models respond in ways that are safe, factual, and consistent with what people actually want and expect.
- Gather diverse feedback: Collect input from a wide range of users and contexts to make sure AI systems understand varied human preferences and avoid cultural biases.
- Refine through reward modeling: Use ranking-based feedback and reward signals to teach models which responses are more desirable, guiding them to improve their output quality.
- Prioritize factual accuracy: Incorporate training methods that focus on reducing false or misleading information so that AI-generated text remains trustworthy and reliable.
-
-
🦄 Today we're releasing Community Alignment - the largest open-source dataset to align LLMs with people's preferences in a variety of cultural contexts, containing ~200k comparisons from >3000 annotators in 5 countries and languages! There was a lot of research that went into this... 🧵 🔍 We started by conducting a joint human study and model evaluation with 15,000 nationally-representative participants from 5 countries & 21 LLMs. We found that the LLMs exhibited an *algorithmic monoculture* were all models aligned with the same minority of human preferences. 🚫 Standard alignment methods fail to learn common human preferences (as identified from our joint human-model study) from existing preference datasets because the candidate responses that people choose from are too homogeneous, even when they are sampled from multiple models. 🥭 Intuitively, if all the candidate responses only cover one set of values, then you'll never be able to learn preferences outside of those values. It is like if someone asks me to pick between four types of apples, but if what I really want is a mango, you won't be measuring that 🌈 To produce more diverse candidate sets, rather than independently sampling them, you want some kind of "negatively-correlated (NC) sampling", where sampling one candidate makes other similar ones less likely. Turns out, prompting can implement this decently well, with win rates jumping from random chance to ~0.8 🤡 💽 Finally, based on these insights we collect and open-source (CC-BY 4.0) the Community Alignment (CA) dataset. Features include: - NC-sampled candidate responses - Multilingual (64% non-English) - >2500 prompts are annotated by >= 10 people - Natural language explanations for > 1/4 of choices and more! This was a big project and collective effort spanning FAIR, AI at Meta, Meta Governance, Meta Policy as well es NYU and Ecole Polytechnique -- major thanks to all the collaborators (see paper) and especially the amazing Smitha Milli and Kris R., who led this project masterfully from start to finish. Also, thanks to Joelle Pineau, Rob Fergus, Stephane Kasriel, and Rob Sherman for their support🙏 And this is not the end! 😉 If you want to support us in doing more of these releases, email communityalignment@meta.com (or me) with feedback on what you liked about CA and what you want to see more of Paper: https://lnkd.in/ejJqGQfS Dataset: https://lnkd.in/e5Vp6z2E
-
𝗗𝗲𝗲𝗽 𝗗𝗶𝘃𝗲 𝗶𝗻𝘁𝗼 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 Very enlightening paper authored by a team of researchers specializing in computer vision and NLP, this survey underscores that pretraining—while fundamental—only sets the stage for LLM capabilities. The paper then highlights 𝗽𝗼𝘀𝘁-𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗺𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺𝘀 (𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴, 𝗿𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗹𝗲𝗮𝗿𝗻𝗶𝗻𝗴, 𝗮𝗻𝗱 𝘁𝗲𝘀𝘁-𝘁𝗶𝗺𝗲 𝘀𝗰𝗮𝗹𝗶𝗻𝗴) as the real game-changer for aligning LLMs with complex real-world needs. It offers: ◼️ A structured taxonomy of post-training techniques ◼️ Guidance on challenges such as hallucinations, catastrophic forgetting, reward hacking, and ethics ◼️ Future directions in model alignment and scalable adaptation In essence, it’s a playbook for making LLMs truly robust and user-centric. 𝗞𝗲𝘆 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆𝘀 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗕𝗲𝘆𝗼𝗻𝗱 𝗩𝗮𝗻𝗶𝗹𝗹𝗮 𝗠𝗼𝗱𝗲𝗹𝘀 While raw pretrained LLMs capture broad linguistic patterns, they may lack domain expertise or the ability to follow instructions precisely. Targeted fine-tuning methods—like Instruction Tuning and Chain-of-Thought Tuning—unlock more specialized, high-accuracy performance for tasks ranging from creative writing to medical diagnostics. 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗼𝗿 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 The authors show how RL-based methods (e.g., RLHF, DPO, GRPO) turn human or AI feedback into structured reward signals, nudging LLMs toward higher-quality, less toxic, or more logically sound outputs. This structured approach helps mitigate “hallucinations” and ensures models better reflect human values or domain-specific best practices. ⭐ 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗜𝗻𝘀𝗶𝗴𝗵𝘁𝘀 ◾ 𝗥𝗲𝘄𝗮𝗿𝗱 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴 𝗜𝘀 𝗞𝗲𝘆: Rather than using absolute numerical scores, ranking-based feedback (e.g., pairwise preferences or partial ordering of responses) often gives LLMs a crisper, more nuanced way to learn from human annotations. Process vs. Outcome Rewards: It’s not just about the final answer; rewarding each step in a chain-of-thought fosters transparency and better “explainability.” ◾ 𝗠𝘂𝗹𝘁𝗶-𝗦𝘁𝗮𝗴𝗲 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴: The paper discusses iterative techniques that combine RL, supervised fine-tuning, and model distillation. This multi-stage approach lets a single strong “teacher” model pass on its refined skills to smaller, more efficient architectures—democratizing advanced capabilities without requiring massive compute. ◾ 𝗣𝘂𝗯𝗹𝗶𝗰 𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆: The authors maintain a GitHub repo tracking the rapid developments in LLM post-training—great for staying up-to-date on the latest papers and benchmarks. Source : https://lnkd.in/gTKW4Jdh ☃ To continue getting such interesting Generative AI content/updates : https://lnkd.in/gXHP-9cW #GenAI #LLM #AI RealAIzation
-
FLAME Factuality-Aware Alignment for Large Language Models Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps:\ supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.
-
This week I experimented with language model alignment, releasing my own fine-tuned version of Llama-3.2-1B that incorporates both instruction tuning and preference alignment! Understanding how language models are created and refined is crucial. The process follows three major steps: 1. Pre-training - The foundation where the original model learns natural language understanding and text generation from a massive corpus of text 2. Supervised Fine-Tuning (SFT) - The pre-trained model is adapted for task-specific performance, such as instilling domain knowledge or instruction tuning for chat-based interactions 3. Preference Alignment - The final refinement stage where the fine-tuned model learns to distinguish between good and bad responses based on human preferences An important insight about fine-tuning: While most discussions of "fine-tuning" refer to Supervised Fine Tuning, and while SFT effectively instills domain-specific knowledge and capabilities, it has one major limitation - while the base model learns to generate responses that are structurally similar to the fine-tuned human answers, SFT does not explicitly discourage unwanted responses. This is where preference alignment becomes critical. By collecting user feedback on chosen versus rejected prompt generations, we can align our language model to consistently generate more favorable, "human preferred" responses in a final training stage. This step is crucial not just for generating specific responses, but quality specific responses. Want to dive deeper? Check out the details of how this is done, including the specific techniques and papers I applied to align my own LLM, in my latest video: https://lnkd.in/ekVSWiDf
Make AI Think Like YOU: A Guide to LLM Alignment
https://www.youtube.com/
-
🚀 Architecture of LLM Fine Tuning SFT vs RLHF Supervised fine tuning and reinforcement fine tuning are two key stages used to align large language models for real world use. This architecture shows how both approaches work internally and why both are needed. 👉 Supervised Fine Tuning (SFT) The model is trained on labeled input and output pairs written by humans. The goal is to teach the model what the correct answer looks like by imitation. Flow - Task input goes into the pretrained model. - Model generates output. - Output is compared with a reference answer. - Cross entropy loss is calculated. - Model weights are updated to reduce the loss. When to use SFT - When you have high quality labeled data. - When the task has a clear correct answer. - When you want to improve accuracy on a specific task or domain. Examples - Training a support bot using past customer questions and correct agent replies. - Fine tuning a medical summarization model using doctor written summaries. - Training a legal document classifier using labeled contracts and clauses. SFT improves accuracy and task understanding but it only learns what exists in the dataset. 👉 Reinforcement Fine Tuning (RLHF or RLAIF) - The model is optimized using feedback instead of explicit labels. - The goal is to align the model with human preferences like helpfulness, safety, tone, and usefulness. Flow - Task input goes into the pretrained model. - Model generates multiple outputs. - A reward model scores them based on preference or quality. - The model is updated to maximize the reward signal. When to use RLHF - When quality is subjective and hard to label. - When behavior, tone, or safety matters more than exact correctness. - When you want to optimize for human satisfaction or policy constraints. Examples - Chatbots trained to be polite, safe, and helpful based on user feedback. - Content moderation models trained to avoid toxic or unsafe responses. - Recommendation systems trained to maximize engagement or satisfaction. RLHF improves behavior, style, and alignment but is more complex and expensive. 💡 Simple difference - SFT teaches the model what is correct. - RLHF teaches the model what is preferred. 🎯 In practice, both are used together. - First SFT teaches the task. - Then RLHF aligns the behavior. That combination is what turns a raw model into a production ready AI system. ➕ Follow Shyam Sundar D. for practical learning on Data Science, AI, ML, and Agentic AI 📩 Save this post for future reference ♻ Repost to help others learn and grow in AI #AI #MachineLearning #GenAI #LLM #LLMOps #AgenticAI #DataScience #DeepLearning #AIAgents #AIOrchestration #RAGPipeline #LLMPipeline #MLOps #MLArchitecture