A Comprehensive Guide to NIM LLM Latency-Throughput Benchmarking# Benchmarking Guide Overview Executive Summary Introduction to LLM Inference Benchmarking Background On How LLM Inference Works Metrics Time to First Token (TTFT) End-to-End Request Latency (e2e_latency) Inter-token Latency (ITL) Tokens Per Second (TPS) Requests Per Second (RPS) Parameters and Best Practices Use Cases Load Control Other Parameters Using AIPerf to Benchmark Step 1. Setting Up an OpenAI-Compatible LLama-3 Inference Service with NVIDIA NIM Step 2. Setting Up AIPerf and Warming Up: Benchmarking a Single Use Case Step 3. Sweeping through a Number of Use Cases Step 4. Analyzing the Output Step 5. Interpreting the Results Benchmarking LoRA Models Best practices for Multi-LoRA deployment Performance Benchmarking Performance Benchmarks Llama-3.3-70b-instruct Results Version: 1.8.0 Version: 1.5.0 Llama-3.1-8b-instruct Results Version: 1.8.0 Version: 1.3.0 Llama-3.1-70b-instruct Results Version: 1.3.0 Hardware Specifications NVIDIA H100 NVIDIA H200 NVIDIA A100 NVIDIA L40s