GLM-4.7-Flash: How To Run Locally

Run & fine-tune GLM-4.7-Flash locally on your device!

GLM-4.7-Flash is Z.ai’s new 30B MoE reasoning model built for local deployment, delivering best-in-class performance for coding, agentic workflows, and chat. It uses ~3.6B parameters, supports 200K context, and leads SWE-Bench, GPQA, and reasoning/chat benchmarks.

GLM-4.7-Flash runs on 24GB RAM/VRAM/unified memory (32GB for full precision), and you can now fine-tune with Unsloth. To run GLM 4.7 Flash with vLLM, see GLM-4.7-Flash in vLLM

Running TutorialFine-tuning

GLM-4.7-Flash GGUF to run: unsloth/GLM-4.7-Flash-GGUF

⚙️ Usage Guide

After speaking with the Z.ai's team, they recommend using their GLM-4.7 sampling parameters:

Default Settings (Most Tasks)
Terminal Bench, SWE Bench Verified

temperature = 1.0

temperature = 0.7

top_p = 0.95

top_p = 1.0

repeat penalty = disabled or 1.0

repeat penalty = disabled or 1.0

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.05

  • Sometimes you'll need to experiment what numbers work best for your use-case.

  • Maximum context window: 202,752

  • Use --jinja for llama.cpp variants

🖥️ Run GLM-4.7-Flash

Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.

Because this guide uses 4-bit, you will need around 18GB RAM/unified memory. We recommend using at least 4-bit precision for best performance.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

1

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

You can directly pull from Hugging Face. You can increase the context to 200K as your RAM/VRAM allows. Follow our usage guide here if not using llama.cpp.

You can also try Z.ai's recommended GLM-4.7 sampling parameters:

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • Remember to disable repeat penalty!

Follow this for general instruction use-cases:

Follow this for tool-calling use-cases:

3

Download the model via (after installing pip install huggingface_hub). You can choose UD-Q4_K_XL or other quantized versions.

4

Then run the model in conversation mode:

Also, adjust context window as required, up to 202752

Reducing repetition and looping

This means you can now use Z.ai's recommended parameters and get great results:

  • For general use-case: --temp 1.0 --top-p 0.95

  • For tool-calling: --temp 0.7 --top-p 1.0

  • If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.05

  • Remember to disable repeat penalty! Or set --repeat-penalty 1.0

We added "scoring_func": "sigmoid" to config.json for the main model - see.

🐦Flappy Bird Example with UD-Q4_K_XL

As an example, we did the following long conversation by using UD-Q4_K_XL via ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --fit on --temp 1.0 --top-p 0.95 --min-p 0.01 --jinja :

which rendered the following Flappy Bird game in HTML form:

Flappy Bird Game in HTML (Expandable)

And we took some screenshots (4bit works):

🦥 Fine-tuning GLM-4.7-Flash

Unsloth now supports fine-tuning of GLM-4.7-Flash, however you will need to use transformers v5. The 30B model does not fit on a free Colab GPU; however, you can use our notebook. 16-bit LoRA fine-tuning of GLM-4.7-Flash will use around 60GB VRAM:

On fine-tuning MoE's, it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

🦙Llama-server serving & deployment

To deploy GLM-4.7-Flash for production, we use llama-server In a new terminal say via tmux, deploy the model via:

Then in a new terminal, after doing pip install openai, do:

Which will print

💻GLM-4.7-Flash in vLLM

You can now use our new FP8 Dynamic quant of the model for premium and fast inference. First install vLLM from nightly:

Then serve Unsloth's dynamic FP8 version of the model. We enabled FP8 to reduce KV cache memory usage by 50%, and on 4 GPUs. If you have 1 GPU, use CUDA_VISIBLE_DEVICES='0' and set --tensor-parallel-size 1 or remove this argument. To disable FP8, remove --quantization fp8 --kv-cache-dtype fp8

You can then call the served model via the OpenAI API:

vLLM GLM-4.7-Flash Speculative Decoding

We found using the MTP (multi token prediction) module from GLM 4.7 Flash makes generation throughput drop from 13,000 tokens on 1 B200 to 1,300 tokens! (10x slower) On Hopper, it should be fine hopefully.

Only 1,300 tokens / s throughput on 1xB200 (130 tokens / s decoding per user)

And 13,000 tokens / s throughput on 1xB200 (still 130 token /s decoding per user)

🔨Tool Calling with GLM-4.7-Flash

See Tool Calling LLMs Guide for more details on how to do tool calling. In a new terminal (if using tmux, use CTRL+B+D), we create some tools like adding 2 numbers, executing Python code, executing Linux functions and much more:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically and call the OpenAI endpoint for any model:

After launching GLM-4.7-Flash via llama-server like in GLM-4.7-Flash or see Tool Calling LLMs Guide for more details, we then can do some tool calls:

Tool Call for mathematical operations for GLM 4.7

Tool Call to execute generated Python code for GLM-4.7-Flash

Benchmarks

GLM-4.7-Flash is the best performing 30B model across all benchmarks except AIME 25.

Last updated

Was this helpful?