One of the biggest deals of yesterday's Meta's announcements is Llama Stack The way I describe it is the operating system for AI with Llama. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗟𝗹𝗮𝗺𝗮 𝗦𝘁𝗮𝗰𝗸? One of the biggest pain points for developers using Llama is how to use it, get started, and make it more developer-friendly overall. With that, Meta creates a stack that standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: - model training - fine-tuning - ai product evaluation - building and running AI agents in production and more 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝗻𝗰𝗹𝘂𝗱𝗲𝗱 𝗶𝗻 𝘁𝗵𝗶𝘀 𝗿𝗲𝗹𝗲𝗮𝘀𝗲? - Llama CLI (command line interface) to build, configure, and run Llama Stack distributions - Client code in multiple languages, including python, node, kotlin, and swift - Docker containers for Llama Stack - Distribution Server and Agents API Provider 𝗟𝗹𝗮𝗺𝗮 𝗦𝘁𝗮𝗰𝗸 𝗰𝗼𝗺𝗲𝘀 𝘄𝗶𝘁𝗵 𝘁𝗵𝗲 𝗳𝗼𝗹𝗹𝗼𝘄𝗶𝗻𝗴 𝗰𝗼𝗿𝗲 𝗰𝗼𝗺𝗽𝗼𝗻𝗲𝗻𝘁𝘀: - PromptStore: For storing and managing prompts for Llama models. - Batch Inference: A tool for making request on batches of data. - Continual Pretraining: A method for continuously pretraining Llama models on new data. - Realtime Inference: A tool for making predictions in real-time. - Quantized Inference: A method for optimizing inference performance through quantization. - Evals: A component for evaluating the performance of Llama models. - Finetuning: A method for fine-tuning Llama models on specific tasks. - Reward Scoring: A tool for scoring rewards for Llama models. - Synthetic Data Generation: A method for generating synthetic data for Llama models. - Data: A component for managing and processing data for Llama models. - Models: A component for managing and deploying Llama models. - Hardware: A component for managing and optimizing hardware resources for Llama models. - Accelerators: A type of hardware accelerator for Llama models. - Storage: A component for managing and storing data for Llama models. - Safety: A component for ensuring the safety and reliability of Llama models. 𝗟𝗹𝗮𝗺𝗮 𝗦𝘁𝗮𝗰𝗸 𝗶𝘀 𝗮𝘃𝗮𝗶𝗹𝗮𝗯𝗹𝗲 𝗶𝗻 𝗠𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗱𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: - Single-node Llama Stack Distribution via Meta internal implementation and Ollama - Cloud Llama Stack distributions via Cloud providers - On-device Llama Stack Distribution on iOS implemented via PyTorch ExecuTorch - On-prem Llama Stack Distribution supported by Dell Another excellent gift for developers. I cannot wait to see what's built on top of Llama Stack and how the community embraces it.
Guide to Meta Llama Large Language Models
Explore top LinkedIn content from expert professionals.
Summary
The “Guide to Meta Llama Large Language Models” introduces Meta’s Llama series, open-source AI models engineered to produce natural language and handle a wide variety of tasks like chatbots, content creation, and data analysis. Large language models (LLMs) like Llama use advanced algorithms to generate realistic text responses based on vast amounts of training data, making them powerful tools for developers and businesses.
- Explore distribution options: Consider running Llama models locally, in the cloud, or on devices to meet your privacy, speed, and hardware needs.
- Fine-tune for tasks: Adjust Llama models with custom data to improve their performance for specific applications such as chat, summarization, or recommendation systems.
- Check licensing restrictions: Review Meta’s open-source license to ensure your intended use—especially for commercial projects—complies with their guidelines, including user limits and model modifications.
-
-
If you've been using fine-tuned open-source LLMs (e.g. for generative A.I. functionality or natural-language conversations with your users), it's very likely time you switch your starting model over to Llama 2. Here's why: • It's open-source and, unlike the original LLaMA, can be used commercially. • Like the Alpaca and Vicuña models that used LLaMA 1 as pretrained starting point, the “Llama 2-chat” variants are fine-tuned for chat applications (using a data set of over 1 million human annotations). • For both pre-trained and chat-fine-tuned variants, the Llama 2 model family has four sizes: 7 billion, 13 billion (fits on a single GPU), 34 billion (not released publicly) and 70 billion model parameters (best performance on NLG benchmark tasks). • The 70B chat-fine-tuned variant offers ChatGPT-level performance on a broad range of natural-language benchmarks (it's the first open-source model to do this convincingly; you can experience this yourself via the free Hugging Face chat interface where Llama-2-70B-chat has become the default) and is generally now the leading open-source LLM. • See the Llama 2 page for a table of details across 11 external benchmarks, which (according to Meta themselves so perhaps take with a grain of salt) shows how 13B Llama 2 is comparable to 40B Falcon, the previous top-ranked open-source LLM across a range of benchmarks. The 70B Llama 2 sets the new state of the art, on some benchmarks by a considerable margin (N.B.: on tasks involving code or math, Llama 2 is not necessarily the best open-source option out there, however.) • Time awareness: “Is earth flat or round?” in 2023 versus "in 800 CE context" relates to different answers. • Has double the context window (4k tokens) of the original LLaMA, which is a big jump from about eight pages to 16 pages of context. • Uses a two-stage RLHF (reinforcement learning from human feedback) approach that is key to its outstanding generative capacity. • A new method called "Ghost Attention" (GAtt) allows it to perform especially well in "multi-turn" (ongoing back and forth) conversation. • Extensive safety and alignment testing (probably more extensive than any other open-source LLM), including (again, Meta self-reported) charts from the Llama 2 technical paper showing A.I. safety violation percentages far below any other open-source LLM and even better than ChatGPT. (The exception being the 34B Llama 2 model, which perhaps explains why this is the only Llama 2 model size that Meta didn’t release publicly.) Like Hugging Face, at my company Nebula.io we've switched to Llama 2 as the starting point for our task-specific fine-tuning and have been blown away. To hear more, including implementation tips, check out today's episode! The SuperDataScience Podcast is available on all major podcasting platforms and a video version is on YouTube. I've left a comment for quick access to today's episode below ⬇️ #superdatascience #machinelearning #llms #chatgpt #generativeai
-
🥁 Llama 2 is Open Source Now - What does it entail, Lets Deep Dive🥁 📒 A quick 101 on Llama 2 :- ⚡ Developed by Meta, successor of Llama 1. ⚡ Comes in three model sizes : 7B, 13B and 70B parameters ⚡ Tokens used to pretrain it : 2 Trillion ⚡ Context Length : 4096 ⚡ Trained specially on chat data : 1 Million human annotations ⚡ Performance : It outperforms other open source language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests. ⚡ License : Open-Source and Free for Research and Commercial use and distribution. ⛔ ** Caveat for Commercial Use : Available for commercial use unless a product has more than 700 million monthly active users. Per Meta : If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. Furthermore users are barred from using Llama 2 to enhance other large language models apart from Llama 2 itself.**⛔ ⚡ Continuous Learning : Llama-2-chat uses reinforcement learning from human feedback to ensure safety and helpfulness. Llama-2-chat is iteratively refined using Reinforcement Learning from Human Feedback (RLHF), which includes rejection sampling and proximal policy optimization (PPO). 📒 Key points: ⚡ Llama 2 outperforms GPT3 and performs on par with ChatGPT in terms of chat model quality (lacks in coding, though) ⚡ As part of the Meta and Microsoft partnership, it will be available through Microsoft’s Azure platform ⚡ Qualcomm is also collaborating with Meta to integrate Llama 2 into laptops, phones, and headsets from 2024 onward ⚡ Llama 2 will also be available through AWS, Hugging Face, and other providers 💡 What Next: ⚡ This will create an intense pressure on proprietary and open-source LLM providers given the power and accuracy of Llama 2 especially around its capabilities towards natural language processing for chatbots or virtual assistants, generating personalized content and providing product recommendations. ⚡ Its not "true" democratization of AI since Meta is restricting any direct competition by restricting use of Llama 2 for companies with less than 700 million monthly active users. Would love to see when this restriction is removed from the agreement. ⚡ Microsoft is a key financial backer of OpenAI but is nonetheless supporting the launch of Llama 2. Will be interesting to see how it plays out. 🎩 To continue getting such interesting content/updates : https://lnkd.in/gXHP-9cW #llama2 #meta #microsoft #azure #openAI #gpt4 #generativeai #llm #opensource
-
Meta's release this week of the open source Llama 3.1 series of models is a big deal. These are the first GPT 4-quality open source models - opening the door for users previously limited by cost and/or security concerns... The Llama 3.1 models come in three sizes: a gigantic 405B (B = billion parameter model, thus the 405B is trained on 405 billion parameters), a midsize 70B model, and a mini 8B model. As you can see from the chart below, Scale AI (an independent firm performing rigorous AI model evaluations) currently rates the Llama 405B model across a wide range of tasks as performing neck-and-neck with GPT 4O. Even the much smaller 70B model, which runs fine on workstation class desktops and laptops, performs close to GPT 4O, particularly on text summarization, writing, and multilingual tasks. For those who have been holding back on the use of AI large language models because of concerns about protecting sensitive information, once installed the Llama models can be run completely securely on a local machine, not even requiring connection to the internet. Currently, the Llama 3.1 models are text-only, but Meta has stated that they already are working on multimodal capabilities for these models. Here's a nice overview from Llama of these models, as well as the links to download them, and associated documentation. https://lnkd.in/eGF622Pc I have also pulled together answers to common questions about Llama 3.1 using Perplexity Pro, including detailed instructions on how to install them locally, the likely implications for Open AI, Anthropic, Google and other providers of frontier-level models, and a discussion of possible risks of making such powerful models available to all. https://lnkd.in/exdCZwvX I hope you find this useful, and I look forward to hearing about your experiences if you give any of these models these a try. #meta #llama31 #ai #llm #dataanalysis
-
Large Language Models (LLMs) are immensely complex Machine Learning systems, trained on the text of the entire internet, capable of generating plausible text responses to prompts. Popularized in November 2022 by OpenAI's ChatGPT, LLMs have created a wave of excitement, investment, and hype greater than any technology that I have ever seen in my 40-year career. Over 86,000 people viewed my recent post, "The Human Brain Is Not a Large Language Model". It seems that there is a widespread desire to understand LLMs, but it is hard to find friendly explanations that non-technical people can understand and relate to. So, here is my friendly yet authoritative explanation of a Large Language Model. LLMs are based on Transformers, which were first described in a famous paper, "Attention Is All You Need" in 2017 by researchers at Google. The famous block diagram of the Transformer is shown below, on the left. Unfortunately, this paper was really about Machine Translation, so the block diagram is not what is used in a modern LLM like ChatGPT or Llama-2 or Llama-3. You have to know about it, but don't use it. In the middle diagram below, I am showing a good block diagram of a Modern LLM. This diagram of a Decoder-Only Transformer was originally published by Umar Jamil, and I have added the Sampler (small purple block at the top) and Auto-Regressor (wire feeding the output back to the next input). Umar has a beautiful 70-minute video explaining how the Decoder-Only Transformer works in Llama-2. I made a nice 28-minute video too. We discuss the Embeddings, Multi-layer architecture, Self-Attention Blocks, Key-Value Cache, Layer Normalization, Rotary Positional Encoding, and final output of Next Token Probabilities. Links in the first comment. You can use this diagram as a complete abbreviated summary of how an LLM works. Finally, on the right, we have a friendly top-level summary of the middle diagram. We are showing all the complexity of the Decoder-Only Transformer in a single block called the Next Token Probability Distribution Predictor. Think of that block as the Billion-Dollar Machine. All that fancy machinery is just looking at the recent tokens, and producing a list of candidate next tokens and their probabilities. The colorful Pie Chart shows the candidate next tokens for the prompt "Why is the sky blue?". The next token could be "\n" (new line), "The", "What", "Why", or other less likely tokens. The Sampler is a Random Number Generator that produces a number between 0 and 1, used to choose the Next Token from the candidates in the Pie Chart. The Billion-Dollar Machine makes the Pie Chart of candidate Next Tokens, and the Sampler is like a Dart-Throwing Monkey, making the executive decision about which token to produce next. Finally, the Auto-Regressor feeds this output token back to the input of the Billion-Dollar Machine, to start the process over again for another new token. (continued briefly in first comment)
-
Meta released its Llama 4 models Scout and Maverick yesterday (on a weekend, really!). While the tech benchmarks are impressive, what's really exciting are the four enterprise-focused innovations: Practicality Over Pure Reasoning: Unlike newer models from OpenAI, DeepSeek, or Google's Gemini, which prioritize abstract reasoning, Llama 4 is intentionally designed around efficiency and use cases. This means faster deployment and a clear focus on outcomes. Efficiency through Mixture-of-Experts (MoE): Llama 4 leverages an innovative MoE architecture, intelligently activating specialized smaller models tailored to specific tasks rather than one oversized AI model. The result is performance that significantly reduces computational demands and costs. Making Sense of the Data Mountain (10M Tokens): Scout's 10-million-token context window is transformative. For diligence teams digging through vast data rooms or COOs analyzing extensive operational records, this capability simplifies processing massive, unstructured datasets like complex customer histories, supply chain logs, or detailed maintenance records. It empowers businesses to pinpoint performance improvements without the need for extensive analyst resources. Seeing the Whole Picture (Multimodal Capabilities): These models natively understand both text and imagery, crucial for extracting insights hidden in graphical reports, visual operational logs, or customer feedback images. This holistic approach to data analysis unlocks actionable insights across multiple dimensions. These are meaningful steps towards driving concrete use cases – a significant step forward for Enterprise AI.
-
🐪 Llama 3.1 is not just a technological advancement; it is a strategic move by Meta to 𝗱𝗲𝗺𝗼𝗰𝗿𝗮𝘁𝗶𝘇𝗲 𝗔𝗜, making powerful tools accessible to a broader audience. 🤩 Enhancements in 𝘤𝘰𝘯𝘵𝘦𝘹𝘵 𝘱𝘳𝘰𝘤𝘦𝘴𝘴𝘪𝘯𝘨, 𝘴𝘺𝘯𝘵𝘩𝘦𝘵𝘪𝘤 𝘥𝘢𝘵𝘢 𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘪𝘰𝘯, 𝘮𝘰𝘥𝘦𝘭 𝘥𝘪𝘴𝘵𝘪𝘭𝘭𝘢𝘵𝘪𝘰𝘯, 𝘮𝘶𝘭𝘵𝘪𝘭𝘪𝘯𝘨𝘶𝘢𝘭 𝘴𝘶𝘱𝘱𝘰𝘳𝘵, 𝘢𝘯𝘥 𝘴𝘦𝘤𝘶𝘳𝘪𝘵𝘺 𝘮𝘦𝘢𝘴𝘶𝘳𝘦𝘴 mark a new era in open-source AI, driving forward the capabilities and applications of large language models. 📐 𝗖𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝘃𝗲 𝗘𝗱𝗴𝗲 Llama 3.1's 𝘣𝘦𝘯𝘤𝘩𝘮𝘢𝘳𝘬 𝘴𝘤𝘰𝘳𝘦𝘴 𝘢𝘳𝘦 𝘤𝘭𝘰𝘴𝘦 𝘵𝘰 𝘱𝘳𝘰𝘱𝘳𝘪𝘦𝘵𝘢𝘳𝘺 𝘮𝘰𝘥𝘦𝘭𝘴 𝘭𝘪𝘬𝘦 𝘎𝘗𝘛–4𝘰 𝘢𝘯𝘥 𝘊𝘭𝘢𝘶𝘥𝘦 3.5 𝘚𝘰𝘯𝘯𝘦𝘵, showcasing its strong capabilities in general knowledge, math, and multilingual translation. 🤝 𝗢𝗽𝗲𝗻-𝗦𝗼𝘂𝗿𝗰𝗲 𝗜𝗺𝗽𝗮𝗰𝘁 Meta's investment in Llama 3.1 reflects its belief in the power of open-source AI. By providing a stable, customizable platform, Meta enables researchers, enterprises, and developers to innovate without sharing data with Meta. It represents a significant advancement in open-source AI, underscored by its partnership with 𝘕𝘝𝘐𝘋𝘐𝘈 and support from major cloud providers like 𝘎𝘰𝘰𝘨𝘭𝘦 𝘊𝘭𝘰𝘶𝘥, 𝘈𝘻𝘶𝘳𝘦, 𝘢𝘯𝘥 𝘈𝘞𝘚. 🔭 𝗞𝗲𝘆 𝗛𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝘀 𝗘𝘅𝗽𝗮𝗻𝗱𝗲𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗪𝗶𝗻𝗱𝗼𝘄: The 1,600% increase enables the model to process and understand much longer pieces of text, facilitating complex reasoning and improved performance on tasks requiring extensive context. 𝗦𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗗𝗮𝘁𝗮 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻: The 405-billion parameter version allows users to create high-quality, task-specific datasets. This capability enhances model accuracy across various fields, including 𝗵𝗲𝗮𝗹𝘁𝗵𝗰𝗮𝗿𝗲, 𝗳𝗶𝗻𝗮𝗻𝗰𝗲, 𝗿𝗲𝘁𝗮𝗶𝗹, 𝗮𝗻𝗱 𝘁𝗲𝗹𝗲𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀. 𝗠𝗼𝗱𝗲𝗹 𝗗𝗶𝘀𝘁𝗶𝗹𝗹𝗮𝘁𝗶𝗼𝗻: The potential for model distillation enables the transfer of knowledge from the large 405B model into smaller, more efficient models. This process reduces costs and latency, making advanced AI accessible to resource-constrained environments. 𝗠𝘂𝗹𝘁𝗶𝗹𝗶𝗻𝗴𝘂𝗮𝗹 𝗦𝘂𝗽𝗽𝗼𝗿𝘁: The model supports multiple languages, enhancing its utility for a global user base. 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗲𝗱 𝗳𝗼𝗿 𝗧𝗼𝗼𝗹 𝗨𝘀𝗲: The models are optimized for tool use, including generating tool calls for searches, image generation, code execution, and mathematical reasoning. 𝗥𝗼𝗯𝘂𝘀𝘁 𝗦𝗲𝗰𝘂𝗿𝗶𝘁𝘆 𝗠𝗲𝗮𝘀𝘂𝗿𝗲𝘀: Meta has introduced Llama Guard 3 for input and output moderation and Prompt Guard for detecting and responding to prompt injection and jailbreak inputs, supporting responsible deployment. 💻You can 𝘥𝘰𝘸𝘯𝘭𝘰𝘢𝘥 the paper here if you want to geek out even more: 𝚑̲𝚝̲𝚝̲𝚙̲𝚜̲:̲/̲/̲𝚊̲𝚒̲.̲𝚖̲𝚎̲𝚝̲𝚊̲.̲𝚌̲𝚘̲𝚖̲/̲𝚛̲𝚎̲𝚜̲𝚎̲𝚊̲𝚛̲𝚌̲𝚑̲/̲𝚙̲𝚞̲𝚋̲𝚕̲𝚒̲𝚌̲𝚊̲𝚝̲𝚒̲𝚘̲𝚗̲𝚜̲/̲𝚝̲𝚑̲𝚎̲–̲𝚕̲𝚕̲𝚊̲𝚖̲𝚊̲–̲𝟹̲–̲𝚑̲𝚎̲𝚛̲𝚍̲–̲𝚘̲𝚏̲–̲𝚖̲𝚘̲𝚍̲𝚎̲𝚕̲𝚜̲ 💻
-
Yesterday's Llama 3.1 release marked a big milestone for LLM researchers and practitioners. Llama 3.1 405B is the biggest and most capable LLM with openly available LLMs. And particularly exciting is that the new Llama release comes with a 93-page research paper this time. Below, I want to share a few interesting facts from the paper, and I will likely write a longer analysis this weekend. Model sizes Llama 3.1 now comes in 3 sizes: 8B, 70B, and 405B parameters. The 8B and 70B variants are sight upgrades from the previous Llama 3 models that have been released in April 2024. (See the figure below for a brief performance comparison). The 405B model was used to improve the 8B and 70B via synthetic data during the finetuning stages. Pretraining Data The 93-page report by Meta (a link to the report is in the comments below) offers amazing detail. Particularly, the section on preparing the 15.6 trillion tokens for pretraining offers so much detail that it would make it possible to reproduce the dataset preparation. However, Meta doesn't share the dataset sources. All we know is that it's trained primarily on "web data." This is probably because of the usual copyright concerns and to prevent lawsuits. Still, it's a great writeup if you plan to prepare your own pretraining datasets as it shares recipes on deduplication, formatting (removal of markdown markers), quality filters, removal of unsafe content, and more. Long-context Support The models support a context size of up to 128k tokens. The researchers achieved this via a multiple-stage process. First, they pretrained it on 8k context windows (due to resource constraints), followed by continued pretraining on longer 128k token windows. In the continued pretraining, they increased the context length in 6 stages. Moreover, they also observed that finetuning requires 0.1% of long-context instruction samples; otherwise, the long-context capabilities will decline. Alignment In contrast to earlier rumors, Llama 3 was not finetuned using both RLHF with proximal policy optimization (PPO) and direct preference optimization (DPO). Following a supervised instruction finetuning stage (SFT), the models were only trained with DPO, not PPO. (Unlike in the Llama 2 paper, unfortunately, the researchers didn't include a chart analyzing the improvements made via this process.). Although they didn't use PPO, they used a reward model for rejection sampling during the instruction finetuning stage. Inference The 405B model required 16k H100 GPUs for training. During inference, the bfloat16-bit version of the model still requires 16 H100 GPUs. However, Meta also has an FP8 version that runs on a single server node (that is, 8xH100s). Performance You are probably curious about how it compares to other models. The short answer is "very favorable", on par with GPT4. Unfortunately, I exceeded the character limit for this LinkedIn post, so I will let the figure below speak for itself.
-
Really nice, beginner friendly, 7 step guide to fine-tuning LLMs from Unsloth! My simple breakdown 👇 🚀 Getting Started: The 7-Step Process 1️⃣ Choose Your Model & Method For beginners, start with smaller models like Llama 3.1 (8B) and use QLoRA, which combines 4-bit quantization with LoRA to handle large models with minimal resources. This approach uses up to 4× less memory than standard methods! 2️⃣ Prepare Your Dataset Quality matters more than quantity! Structure your data as question-answer pairs for best results. While simply dumping code data can work for certain applications, well-structured datasets generally lead to better performance. 3️⃣ Optimize Your Hyperparameters The guide offers practical ranges for crucial settings: >> Learning rate: 1e-4 to 5e-5 (balance between learning speed and stability) >> Epochs: 1-3 (more than 3 reduces creativity but may decrease hallucinations) >> Context length: Start with 2048 tokens for testing 4️⃣ Avoid Common Pitfalls >> Overfitting: When your model memorizes training data instead of learning to generalize Solutions: Reduce learning rate, fewer epochs, combine with generic datasets >> Underfitting: When your model doesn't learn enough from training Solutions: Increase learning rate, more epochs, more relevant data 5️⃣ Training During training, aim for a loss value close to 0.5. The guide recommends: >> per_device_train_batch_size = 2 >> gradient_accumulation_steps = 4 >> max_steps = 60 (or num_train_epochs = 1 for full runs) >> learning_rate = 2e-4 6️⃣ Evaluation For evaluation, you can either: >> vibe check: Chat with the model to assess quality manually >> test check: Set aside 20% of your data for testing >> Use automatic evaluation tools like EleutherAI's lm-evaluation-harness 7️⃣ Save & Deploy The fine-tuned model can be saved as a small 100MB LoRA adapter file or pushed directly to Hugging Face. From there, you can run it using various inference engines like Ollama, vLLM, or Together via the LoRA inference feature. 💡 Why This Matters Fine-tuning lets you create specialized AI agents that can: >> Update domain knowledge without retraining from scratch >> Match your desired tone and communication style >> Optimize for specific tasks like sentiment analysis, customer service, or legal work >> The most exciting part? Fine-tuning can replicate all of RAG's capabilities, but RAG can't replicate all of fine-tuning's benefits. https://lnkd.in/ggWkFMMp