Understanding LLM Self-Routing in Inference

Explore top LinkedIn content from expert professionals.

Summary

Understanding LLM self-routing in inference means using smart systems that decide which AI model should handle a user's query, based on how tough the question is and what each model does best. This approach helps balance response quality and computing costs by assigning simpler tasks to cheaper models and tougher questions to more powerful, expensive models.

Assess query difficulty: Build your routing system to analyze each incoming question and decide which model is best suited for the job.
Mix and match models: Combine specialized and general-purpose AI models so your system can pick the right one for each type of request.
Monitor performance: Regularly check how your routing decisions impact costs and response quality, and adjust your setup as your needs change.

Summarized by AI based on LinkedIn member posts

Zain Hasan

I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

19,926 followers 1y
Report this post
You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer
No more previous content

No more next content
10 Comments
Like Comment
Faizan J.

Data Science & AI/ML for Healthcare, E-commerce/Retail, HRTech

7,305 followers 1y
Report this post
𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚) and 𝗟𝗼𝗻𝗴-𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 (𝗟𝗖) are two actively discussed approaches for GenAI architectures. RAG combines capabilities of Language Models (LLMs) with a search/information retrieval system to retrieve relevant information from external sources to augment the models responses. LCs can process long sequences of text to understand and integrate large amounts of information to generate coherent and contextually accurate responses over extended interactions. The paper 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 𝗼𝗿 𝗟𝗼𝗻𝗴-𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗟𝗠𝘀? 𝗔 𝗖𝗼𝗺𝗽𝗿𝗲𝗵𝗲𝗻𝘀𝗶𝘃𝗲 𝗦𝘁𝘂𝗱𝘆 𝗮𝗻𝗱 𝗛𝘆𝗯𝗿𝗶𝗱 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵 provides practical guidelines for applying RAG and LC and highlights the tradeoffs in cost and performance. While LCs provide better performance and accuracy, RAG has significantly lower computation cost due to decreased input lengths, especially considering that LLM API pricing is based on number of input tokens. The paper compares Gemini 1.5 Pro (with a million token context window), GPT-4o, and GPT-3.5-Turbo and the Contriver and Dragon retrievers (for RAG extensions). The experiments show that LC and RAG predictions are identical for up to 70% of the queries. This leads to a design decision to leverage RAG for majority of the queries and reserve the computationally expensive LC for a small subset of queries where it truly excels. The authors propose a hybrid approach called 𝗦𝗲𝗹𝗳-𝗿𝗼𝘂𝘁𝗲 𝘄𝗵𝗶𝗰𝗵 𝗵𝗮𝘀 𝗮 𝗥𝗮𝗴-𝗮𝗻𝗱-𝗿𝗼𝘂𝘁𝗲 𝘀𝘁𝗲𝗽: 𝘀𝗲𝗻𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗲𝗱 𝗥𝗔𝗚 𝗰𝗵𝘂𝗻𝗸𝘀 𝗮𝗻𝗱 𝘁𝗵𝗲 𝗾𝘂𝗲𝗿𝘆 𝘁𝗼 𝗮𝗻 𝗟𝗟𝗠 𝘄𝗵𝗶𝗰𝗵 𝘄𝗶𝗹𝗹 𝗱𝗲𝗰𝗶𝗱𝗲 𝘄𝗵𝗲𝘁𝗵𝗲𝗿 𝘁𝗵𝗲 𝗾𝘂𝗲𝗿𝘆 𝗶𝘀 𝗮𝗻𝘀𝘄𝗲𝗿𝗮𝗯𝗹𝗲 𝗼𝗿 𝗻𝗼𝘁. If it is, then the RAG answer is taken as the final answer. If not, the query and the full context is provided to the LC for the final answer. Examples: 𝗤𝘂𝗲𝗿𝘆: "𝘾𝙤𝙢𝙥𝙖𝙧𝙚 𝙩𝙝𝙚 𝙛𝙚𝙖𝙩𝙪𝙧𝙚𝙨 𝙖𝙣𝙙 𝙥𝙧𝙞𝙘𝙚𝙨 𝙤𝙛 𝙩𝙝𝙚 𝙩𝙤𝙥 𝙩𝙝𝙧𝙚𝙚 𝙣𝙤𝙞𝙨𝙚-𝙘𝙖𝙣𝙘𝙚𝙡𝙞𝙣𝙜 𝙝𝙚𝙖𝙙𝙥𝙝𝙤𝙣𝙚𝙨." 𝗦𝗲𝗹𝗳-𝗥𝗼𝘂𝘁𝗲 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻: 𝗥𝗔𝗚:The query is routed to RAG to fetch specific product details and current pricing from e-commerce websites. 𝗤𝘂𝗲𝗿𝘆: "𝘿𝙚𝙫𝙚𝙡𝙤𝙥 𝙖 𝙘𝙤𝙢𝙥𝙧𝙚𝙝𝙚𝙣𝙨𝙞𝙫𝙚 𝙩𝙧𝙚𝙖𝙩𝙢𝙚𝙣𝙩 𝙥𝙡𝙖𝙣 𝙛𝙤𝙧 𝙖 𝙥𝙖𝙩𝙞𝙚𝙣𝙩 𝙬𝙞𝙩𝙝 𝙢𝙪𝙡𝙩𝙞𝙥𝙡𝙚 𝙘𝙝𝙧𝙤𝙣𝙞𝙘 𝙘𝙤𝙣𝙙𝙞𝙩𝙞𝙤𝙣𝙨, 𝙘𝙤𝙣𝙨𝙞𝙙𝙚𝙧𝙞𝙣𝙜 𝙩𝙝𝙚𝙞𝙧 𝙢𝙚𝙙𝙞𝙘𝙖𝙡 𝙝𝙞𝙨𝙩𝙤𝙧𝙮, 𝙘𝙪𝙧𝙧𝙚𝙣𝙩 𝙢𝙚𝙙𝙞𝙘𝙖𝙩𝙞𝙤𝙣𝙨, 𝙖𝙣𝙙 𝙡𝙞𝙛𝙚𝙨𝙩𝙮𝙡𝙚 𝙛𝙖𝙘𝙩𝙤𝙧𝙨." 𝗦𝗲𝗹𝗳-𝗥𝗼𝘂𝘁𝗲 𝗗𝗲𝗰𝗶𝘀𝗶𝗼𝗻: 𝗟𝗖: The query is routed to the LC model because it requires integrating extensive patient history and detailed context. https://lnkd.in/gsAkFQCu

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach aclanthology.org
Like Comment
Kuldeep Singh Sidhu

Senior Data Scientist @ Walmart | BITS Pilani

16,490 followers 9mo
Report this post
Reasoning Agentic RAG: The Evolution from Static Pipelines to Intelligent Decision-Making Systems The AI research community has just released a comprehensive survey that could reshape how we think about Retrieval-Augmented Generation. Moving beyond traditional static RAG pipelines, researchers from leading institutions including Beijing University of Posts and Telecommunications, University of Georgia, and SenseTime Research have mapped out the emerging landscape of Reasoning Agentic RAG. The Core Innovation: System 1 vs System 2 Thinking Drawing from cognitive science, the survey categorizes reasoning workflows into two distinct paradigms: Predefined Reasoning (System 1): Fast, structured, and efficient approaches that follow fixed modular pipelines. These include route-based methods like RAGate that selectively trigger retrieval based on model confidence scores, loop-based systems like Self-RAG that enable iterative refinement through retrieval-feedback cycles, and tree-based architectures like RAPTOR that organize information hierarchically using recursive structures. Agentic Reasoning (System 2): Slow, deliberative, and adaptive systems where the LLM autonomously orchestrates tool interaction during inference. The model actively monitors its reasoning process, identifies knowledge gaps, and determines when and how to retrieve external information. Under the Hood: Technical Mechanisms The most fascinating aspect is how these systems work internally. In prompt-based agentic approaches, frameworks like ReAct interleave reasoning steps with tool use through Thought-Action-Observation sequences, while function calling mechanisms provide structured interfaces for LLMs to invoke search APIs based on natural language instructions. Training-based methods push even further. Systems like Search-R1 use reinforcement learning where the search engine becomes part of the RL environment, with the LLM learning policies to generate sequences including both internal reasoning steps and explicit search triggers. DeepResearcher takes this to the extreme by training agents directly in real-world web environments, fostering emergent behaviors like cross-validation of information sources and strategic plan adjustment. The Technical Architecture What sets these systems apart is their dynamic control logic. Unlike traditional RAG's static retrieve-then-generate pattern, agentic systems can rewrite failed queries, choose different retrieval methods, and integrate multiple tools-vector databases, SQL systems, and custom APIs-before finalizing responses. The distinguishing quality is the system's ability to own its reasoning process rather than executing predetermined scripts. The research indicates we're moving toward truly autonomous information-seeking systems that can adapt their strategies based on the quality of retrieved information, marking a significant step toward human-like research and problem-solving capabilities.
No more previous content

No more next content
1 Comment
Like Comment
Poonam Lamba

5,344 followers 2mo
Report this post
For developers and platform engineers managing LLM infrastructure, the llm-d team just dropped a deep dive into solving one of the hardest problems in inference: Load Balancing requests for LLMs. The standard approach uses heuristic weights (queue depth, memory pressure, cache locality). But in production, these signals conflict, and manual tuning can't keep up with bursty traffic. The solution? Predictive-Latency Based Scheduling. Instead of guessing, a lightweight XGBoost model is used which is trained from live traffic. The model predicts: 🔹 TTFT (Time to First Token) 🔹 TPOT (Time Per Output Token) The results are massive: 43% improvement in P50 end-to-end latency. 70% improvement in TTFT. It dynamically balances "spreading" (to reduce batch size) vs. "consolidation" (to maximize KV cache reuse) based on real-time performance, not static guesses. Check out the full breakdown of how they built it and the benchmark results: 🔗 https://lnkd.in/gwdB-kV9 #LLM #GenerativeAI #MLOps #Kubernetes #AIInfrastructure #LLMInference

Predicted-Latency Based Scheduling for LLMs | llm-d llm-d.ai
Like Comment
Rachitt Shah

AI at Accel. Built an AI consulting firm before

29,963 followers 1y
Report this post
Understanding LLM Routing What is LLM Routing? LLM routing is a technique used to dynamically direct user queries to the most appropriate Large Language Model (LLM) based on the complexity and specificity of the query. The primary goal is to balance response quality and computational cost by leveraging both high-quality closed LLMs (e.g., GPT-4) and cost-effective open-source LLMs (e.g., Mixtral-8x7B). Key Points on Building an LLM Router by Anyscale(h/t: Amjad Almahairi) 1. Data Collection and Labeling: - Anyscale collected a diverse set of queries from the Nectar dataset, which includes responses from various models, including GPT-4. - Queries were labeled using a 1-5 scoring system based on the quality of responses from Mixtral-8x7B, with higher scores indicating better quality. 2. Model Selection: - GPT-4 was chosen as the closed LLM for its superior response quality. - Mixtral-8x7B was selected as the open-source LLM for its cost-effectiveness. 3. Causal LLM Classifier: - A Llama3-8B model was finetuned as a causal LLM classifier to route queries based on their complexity. - The classifier was trained to predict the quality score of Mixtral-8x7B's response to a given query. 4. Training Process: - The training involved full-parameter finetuning of the Llama3-8B model using Anyscale's API. - The dataset was balanced to ensure the model was not biased towards any specific label. 5. Evaluation: - Offline evaluations were conducted using benchmarks such as MT Bench and GSM8K. - The performance of the LLM router was compared against random routing and other public LLM routing systems. 6. Routing Decision: - The router directs "simple" queries to Mixtral-8x7B if the predicted score is high (4-5), maintaining high response quality while reducing costs. - More complex queries are routed to GPT-4 to ensure high-quality responses. 7. Results: - The LLM router achieved significant cost reductions while maintaining response quality. - Evaluations showed that the router could achieve up to a 70% cost reduction on MT Bench and a 40% cost reduction on GSM8K compared to using GPT-4 alone. Advantages of LLM Routing - Cost Efficiency: By routing simpler queries to cost-effective models, LLM routing significantly reduces computational costs. - High-Quality Responses: Complex queries are directed to high-quality models like GPT-4, ensuring that response quality is not compromised. - Scalability: The system can handle a high volume of queries by efficiently distributing the load between different models. - Flexibility: The routing framework can be adapted to include new models and updated based on evolving performance metrics. - Optimized Resource Utilization: Balances the use of computational resources, ensuring that high-cost models are only used when necessary.
No more previous content

No more next content
6 Comments
Like Comment
Shyam Sundar D.

Data Scientist | AI & ML Engineer | Generative AI, NLP, LLMs, RAG, Agentic AI | Deep Learning Researcher | 4M+ Impressions

6,187 followers 3mo
Report this post
🚀 Prompt Routing Architecture Prompt routing architecture determines response quality, latency, and reasoning depth in modern AI systems. Four execution modes exist, each optimized for a different workload profile. 👉 Instant mode routes directly to a fast inference model. Best for quick factual queries, autocomplete, and lightweight tasks where latency matters more than deep reasoning. 👉 Auto mode sends prompts through a router that selects the optimal model path. Routing decisions depend on prompt complexity, token length, and reasoning signals detected in the input. 👉 Thinking mode activates structured reasoning chains. Intermediate reasoning steps are generated internally before producing the final response. This improves accuracy for logic, math, debugging, and multi step analysis. 👉 Pro mode runs multiple parallel reasoning paths. A reward model scores candidate outputs and selects the highest quality answer. This approach resembles ensemble inference and significantly boosts reliability for complex problem solving. Safety layers then evaluate the selected response using topic classifiers and reasoning monitors before delivery to the interface. Example: - A simple question like “capital of Japan” is handled by Instant mode. - A request like “optimize a distributed training pipeline with cost constraints” is routed to Thinking or Pro mode because it requires planning, trade off analysis, and multi step reasoning. Understanding routing systems is essential for building production grade AI platforms. Performance does not depend only on the model. It depends on orchestration, evaluation, and safety layers working together. ➕ Follow Shyam Sundar D. for practical learning on Data Science, AI, ML, and Agentic AI 📩 Save this post for future reference ♻ Repost to help others learn and grow in AI #AI #GenAI #LLM #SystemDesign #MachineLearning #AIArchitecture #DeepLearning #TechExplained
No more previous content

No more next content
1 Comment
Like Comment
Himanshu Joshi

Building Aligned, Safe and Secure AI

29,901 followers 1y
Report this post
Introducing Symbolic-MoE, a groundbreaking framework from the recent study "Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning." This innovative approach focuses on dynamically selecting and combining pre-trained large language models (LLMs) based on their specialized skills, ensuring precise responses to diverse queries. 🔑 Key Insights from the paper:- ✅ Skill-Based Routing – By leveraging the strengths of expert LLMs for specific tasks like algebra and molecular biology, the model enhances accuracy across benchmarks such as MMLU-Pro, GPQA, AIME, and MedMCQA. ✅ Efficient Expert Aggregation – Each expert contributes unique reasoning processes, culminating in a consolidated, high-quality response through an aggregator LLM. ✅ Scalable Inference – The batch inference strategy streamlines integration of multiple models on a single GPU by grouping instances according to expert assignments, delivering performance on par with or surpassing previous multi-agent methods. 🔹 How Symbolic MoE Outperforms Multi-Agentic Systems? ✅ Fine-Grained Expertise Selection:- Multi-Agent: Assigns experts at the task level (too broad). Symbolic MoE: Selects experts per instance, ensuring specialized expertise. ✅ Adaptive & Dynamic Skill-Based Routing:- Multi-Agent: Uses predefined agents, limiting flexibility. Symbolic MoE: Dynamically recruits the best experts based on strengths. ✅ Single-Round Expert Aggregation:- Multi-Agent: Requires expensive multi-round discussions. Symbolic MoE: Synthesizes expert outputs in one efficient step. ✅ Superior Computational Efficiency:- Multi-Agent: High model loading/offloading costs. Symbolic MoE: Optimized batch inference strategy, integrating 16 models on a single GPU (vs. 4 GPUs for multi-agent). ✅ Higher Performance on Benchmarks:- 8.15% absolute improvement over multi-agent baselines on MMLU-Pro, GPQA, AIME, and MedMCQA. This study underscores the transformative potential of adaptive expert selection in bolstering AI systems' reasoning abilities while streamlining computational operations. 🔗 Dive deeper into the research: arxiv.org/abs/2503.05641 💻 GitHub - https://lnkd.in/dEJcjU4a 🌎 Website - https://lnkd.in/dnbRBkaC 📢 Join the conversation on the implications of expert routing in LLM architectures! #ArtificialIntelligence #MachineLearning #DeepLearning #AIResearch #LLMs #NLP #SymbolicAI #MixtureOfExperts #MultiAgentSystems #AIOptimization #LLMScaling #AIInference
No more previous content

No more next content
3 Comments
Like Comment
Manjeet Singh

Sr Director, Agentforce AI @ Salesforce | Building Autonomous AI Agent Platforms for Enterprise | Ex VP ServiceNow, Startups | AI Advisor

15,203 followers 1y
Report this post
RouteLLM is such an incredible concept for the trade-off between model performance and computational costs 💰 This approach allows for the efficient use of resources by reserving powerful models for challenging tasks while routing simpler queries to more economical options (with 2X or more cost saving as claimed in this paper) 🗞 Paper "RouteLLM: Learning to Route LLMs with Preference Data". arxiv.org/abs/2406.18665 One of the key innovations of RouteLLM is its use of human preference data for training the router. The researchers leveraged data from the Chatbot Arena, a platform where users compare responses from different LLMs, to create a rich dataset of human preferences. This data provides valuable insights into the relative strengths and weaknesses of various models across different types of queries. The RouteLLM framework employs several sophisticated routing techniques: 1. Similarity-weighted (SW) ranking 2. Matrix factorization 3. BERT-based classifier 4. Causal LLM classifier Interestingly, the RouteLLM approach demonstrates strong transfer learning capabilities. The routers maintained their performance even when the underlying strong and weak models were changed at test time, suggesting a robust and generalizable solution for LLM deployment.
No more previous content

No more next content
4 Comments
Like Comment

Understanding LLM Self-Routing in Inference

Summary

More in Understanding AI Systems

Explore categories