Tool Calling LLMs Guide

In this guide, you will learn how to use LLMs with Tool Calling with inference via llama.cpp, llama-server and OpenAI endpoints

Our guide should work for nearly any model we provide including:

  1. Qwen3-Coder, Qwen3-Next and other Qwen models

  2. And nearly all models within our model catalog.

🔨Tool Calling Setup

In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

Now we'll showcase multiple methods of running tool-calling for many different use-cases below:

Writing a story:

Mathematical operations:

Execute generated Python code

Execute arbitrary terminal functions

🔧 GLM-4.7-Flash + GLM 4.7 Calling

We first download GLM-4.7 or GLM-4.7-Flash via some Python code, then launch it via llama-server in a separate terminal (like using tmux). In this example we download the large GLM-4.7 model:

If you ran it successfully, you should see:

Now launch it via llama-server in a new terminal. Use tmux if you want:

And you will get:

Now in a new terminal and executing Python code, reminder to run Tool Calling Setup We use GLM 4.7's optimal parameters of temperature = 0.7 and top_p = 1.0

Tool Call for mathematical operations for GLM 4.7

Tool Call to execute generated Python code for GLM 4.7

⚒️Devstral 2 Tool Calling

We first download Devstral 2 via some Python code, then launch it via llama-server in a separate terminal (like using tmux):

If you ran it successfully, you should see:

Now launch it via llama-server in a new terminal. Use tmux if you want:

You will see the below if it succeeded:

We then call the model with the following message and with Devstral's suggested parameters of temperature = 0.15 only. Reminder to run Tool Calling Setup

Last updated

Was this helpful?