Models / Libraries / Frameworks

Integrate and Deploy Tongyi Qwen3 Models into Production Applications with NVIDIA

Alibaba recently released Tongyi Qwen3, a family of open-source hybrid-reasoning large language models (LLMs). The Qwen3 family consists of two MoE models, 235B-A22B (235B total parameters and 22B active parameters) and 30B-A3B, and six dense models, including the 0.6B, 1.7B, 4B, 8B, 14B, 32B versions.

With ultra-fast token generation, developers can efficiently integrate and deploy Qwen3 models into production applications on NVIDIA GPUs, using different frameworks such as NVIDIA TensorRT-LLM, Ollama, SGLang, and vLLM. 

In this post, we walk through best practices for using models from the Qwen3 family. We demonstrate how to use these frameworks to deploy models for inference. Depending on ‌use case requirements, such as high throughput, low latency, or GPU footprint, developers can choose the most appropriate framework. 

Qwen3 models

Qwen3 is the first hybrid-reasoning LLM in China, with state-of-the-art accuracy across popular benchmarks such as AIME, LiveCodeBench, ArenaHard, BFCL. Qwen3 offers a family of models with a comprehensive suite of dense and mixture-of-experts (MoE) models, which are built through advancements in reasoning, instruction-following, agent capabilities, and multilingual support. It’s also one of the world’s leading open-source models. 

LLM Inference performance drives real-time, cost-effective production deployment

The dynamic and evolving LLM ecosystem, with the continuous introduction of new models and technologies, requires high-performance and flexible solutions to optimize LLMs for production deployments. 

Designing efficient inference systems is challenging with constantly increasing demands. These challenges include the varying computational and memory requirements of the prefill and decoding phases in LLM inference, the need for parallel distributed inference in extra-large models, massive concurrent requests, and highly dynamic input/output length requests.

There are many optimization techniques available in inference engines, including high-performance kernels, low-precision quantization, batch scheduling, sampling optimization, and KV cache optimization. Developers must experiment to learn which combination of technologies is best for their scenarios. 

TensorRT-LLM provides the latest highly optimized compute kernels, high-performance attention implementation, multi-node and multi-GPU communication distributed support, various parallelism and quantization strategies, to perform inference efficiently on NVIDIA GPUs. Further, the TensorRT-LLM new architecture, built with PyTorch, combines peak performance with a flexible and developer-friendly workflow. Use the LLM API to quickly set up inference. 

With TensorRT-LLM, developers can quickly get started with advanced optimizations, including custom attention kernels, in-flight batching, paged KV caching, quantization (FP8, FP4, INT4 AWQ, and INT8 SmoothQuant), speculative decoding, and much more.

Run Qwen3 inference optimization with TensorRT-LLM

We use Qwen3-4B as our model to set up a PyTorch backend, then perform benchmarking.

  1. First, prepare the benchmark test dataset and extra-llm-api-config.yml configuration file.
python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
    --tokenizer=/path/to/Qwen3-4B \
    --stdout token-norm-dist --num-requests=32768 \
    --input-mean=1024 --output-mean=1024 \    --input-stdev=0 --output-stdev=0 > /path/to/dataset.txt

cat >/path/to/extra-llm-api-config.yml <<EOF
pytorch_backend_config:
    use_cuda_graph: true
    cuda_graph_padding_enabled: true
    cuda_graph_batch_sizes:
    - 1
    - 2
    - 4
    - 8
    - 16
    - 32
    - 64
    - 128
    - 256
    - 384
    print_iter_log: true
    enable_overlap_scheduler: true
EOF
  1. Run the benchmark command using trtllm-bench.
trtllm-bench \
      --model Qwen/Qwen3-4B \
      --model_path /path/to/Qwen3-4B \
      throughput \
      --backend pytorch \
      --max_batch_size 128 \
      --max_num_tokens 16384 \
      --dataset /path/to/dataset.txt \
      --kv_cache_free_gpu_mem_fraction 0.9 \
      --extra_llm_api_options /path/to/extra-llm-api-config.yml \
      --concurrency 128 \
      --num_requests 32768 \
      --streaming

With the same GPU configuration, based on ISL = 1K and OSL = 1K, we achieved up to 16.04x inference throughput (token/sec) speedups of Qwen3-4B dense model running TensorRT-LLM and using BF16 precision, compared to the BF16 baseline.

Qwen3-4B inference optimization by TensorRT-LLM.
Figure 1: Inference throughput (token/sec) speedups of Qwen3-4B using TensorRT-LLM BF16 compared to BF16 baseline

Similar steps can be applied to other Qwen3 models to perform model optimization. 

  1. Run the serve command using trtllm-serve
trtllm-serve \
  /path/to/Qwen3-4B \
  --host localhost \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 128 \
  --max_num_tokens 16384 \
  --kv_cache_free_gpu_memory_fraction 0.95 \
  --extra_llm_api_options /path/to/extra-llm-api-config.yml
  1. After the model is successfully hosted, inference calls can be made using the standard OpenAI API.
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-4B",
           "Max_tokens": 1024,
           "Temperature": 0,
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Run Qwen3-4B with Ollama, SGLang, and vLLM frameworks

Beyond TensorRT-LLM, Qwen models can also be deployed using frameworks such as Ollama, SGLang, and vLLM on NVIDIA GPUs with a few easy steps. This release of Qwen provides several sized models that can be used on-device, such as NVIDIA RTX for Windows developers and NVIDIA Jetson.

To run it using Ollama for local execution:

    • Download and install the latest version of Ollama from ollama.com/download.
    • Execute the model using the ollama run command. This action will both retrieve and initiate the model for interaction.
    ollama run qwen3:4b
    • Add  /think (default) and /no_think to user prompts or system messages to switch between the model’s thinking mode. After running the ollama run command, you can test the thinking modes directly in the terminal using the following example prompts:
    "Write a python lambda function to add two numbers" - Thinking mode enabled
    "Write a python lambda function to add two numbers /no_think" - Non-thinking mode
    • Refer to ollama.com/library/qwen3 for additional model variants, which are optimized based on the specific NVIDIA GPU being utilized.

    To run it using SGLang:

    • Install SGLang library
    pip install "sglang[all]"
    • Download the model. For this demo, we downloaded the model from Hugging Face using the huggingfaceCLI command prompt. Note that you’ll need to provide an API key to download the model.
    huggingface-cli download --resume-download Qwen/Qwen3-4B --local-dir ./
    • Load and run the model. Note that there are additional arguments that can be passed for different needs. Refer to the documentation for more details. 
    python -m sglang.launch_server \     
      --model-path /ssd4TB/huggingface/hub/models/ \
      --trust-remote-code \
      --device "cuda:0" \
      --port 30000 \
      --host 0.0.0.0
    • Call the model for inference
    curl -X POST "http://localhost:30000/v1/chat/completions" \
    	-H "Content-Type: application/json" \
    	--data '{
    		"model": "Qwen/Qwen3-4B",
    		"messages": [
    			{
    				"role": "user",
    				"content": "What is the capital of France?"
    			}
    		]
    	}'

    To run it using vLLM:

    • Install vLLM library
    pip install vllm
    • Load and run the model using VLLM serve. Note that there are additional arguments that can be passed for different needs. Refer to the documentation for more details.
    vllm serve "Qwen/Qwen3-4B" \
     --tensor-parallel-size 1 \
     --gpu-memory-utilization 0.85 \
     --device "cuda:0" \
     --max-num-batched-tokens 8192 \
     --max-num-seqs 256
    • Call the model for inference.
    curl -X POST "http://localhost:8000/v1/chat/completions" \
    	-H "Content-Type: application/json" \
    	--data '{
    		"model": "Qwen/Qwen3-4B",
    		"messages": [
    			{
    				"role": "user",
    				"content": "What is the capital of France?"
    			}
    		]
    	}'

    Summary

    With a few commands, developers can test out the new family of Qwen models on NVIDIA GPUs using popular inference frameworks, such as TensorRT-LLM, to accelerate AI inference.

    Lastly, the selection of a framework for model deployment and inference relies on key parameters such as balancing performance, resources, and cost when deploying AI models in production.

    Discuss (0)

    Tags