Skip to content

Commit facbd80

Browse files
Merge pull request #138 from unslothai/fix-magistral-inference
fix magistral inference
2 parents 95d57db + 4b016ec commit facbd80

5 files changed

+85
-75
lines changed

‎nb/Kaggle-Magistral_(24B)-Reasoning-Conversational.ipynb‎

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1056,17 +1056,19 @@
10561056
],
10571057
"source": [
10581058
"messages = [\n",
1059-
" {\"role\" : \"user\", \"content\" : \"Solve (x + 2)^2 = 0.\"}\n",
1059+
" {\"role\" : \"user\", \"content\" : [{\"type\": \"text\", \"text\": \"Solve (x + 2)^2 = 0.\"}]}\n",
10601060
"]\n",
1061-
"text = tokenizer.apply_chat_template(\n",
1061+
"inputs = tokenizer.apply_chat_template(\n",
10621062
" messages,\n",
1063-
" tokenize = False,\n",
1063+
" tokenize = True,\n",
10641064
" add_generation_prompt = True, # Must add for generation\n",
1065-
")\n",
1065+
" return_tensors = \"pt\",\n",
1066+
" return_dict = True,\n",
1067+
").to(\"cuda\")\n",
10661068
"\n",
10671069
"from transformers import TextStreamer\n",
10681070
"_ = model.generate(\n",
1069-
" **tokenizer(text, return_tensors = \"pt\").to(\"cuda\"),\n",
1071+
" **inputs,\n",
10701072
" max_new_tokens = 1024, # Increase for longer outputs!\n",
10711073
" temperature = 0.7, top_p = 0.95,\n",
10721074
" streamer = TextStreamer(tokenizer, skip_prompt = True),\n",
@@ -6419,4 +6421,4 @@
64196421
},
64206422
"nbformat": 4,
64216423
"nbformat_minor": 0
6422-
}
6424+
}

‎nb/Magistral_(24B)-Reasoning-Conversational.ipynb‎

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1056,17 +1056,19 @@
10561056
],
10571057
"source": [
10581058
"messages = [\n",
1059-
" {\"role\" : \"user\", \"content\" : \"Solve (x + 2)^2 = 0.\"}\n",
1059+
" {\"role\" : \"user\", \"content\" : [{\"type\": \"text\", \"text\": \"Solve (x + 2)^2 = 0.\"}]}\n",
10601060
"]\n",
1061-
"text = tokenizer.apply_chat_template(\n",
1061+
"inputs = tokenizer.apply_chat_template(\n",
10621062
" messages,\n",
1063-
" tokenize = False,\n",
1063+
" tokenize = True,\n",
10641064
" add_generation_prompt = True, # Must add for generation\n",
1065-
")\n",
1065+
" return_tensors = \"pt\",\n",
1066+
" return_dict = True,\n",
1067+
").to(\"cuda\")\n",
10661068
"\n",
10671069
"from transformers import TextStreamer\n",
10681070
"_ = model.generate(\n",
1069-
" **tokenizer(text, return_tensors = \"pt\").to(\"cuda\"),\n",
1071+
" **inputs,\n",
10701072
" max_new_tokens = 1024, # Increase for longer outputs!\n",
10711073
" temperature = 0.7, top_p = 0.95,\n",
10721074
" streamer = TextStreamer(tokenizer, skip_prompt = True),\n",
@@ -6419,4 +6421,4 @@
64196421
},
64206422
"nbformat": 4,
64216423
"nbformat_minor": 0
6422-
}
6424+
}

‎original_template/Magistral_(24B)-Reasoning-Conversational.ipynb‎

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1033,17 +1033,19 @@
10331033
],
10341034
"source": [
10351035
"messages = [\n",
1036-
" {\"role\" : \"user\", \"content\" : \"Solve (x + 2)^2 = 0.\"}\n",
1036+
" {\"role\" : \"user\", \"content\" : [{\"type\": \"text\", \"text\": \"Solve (x + 2)^2 = 0.\"}]}\n",
10371037
"]\n",
1038-
"text = tokenizer.apply_chat_template(\n",
1038+
"inputs = tokenizer.apply_chat_template(\n",
10391039
" messages,\n",
1040-
" tokenize = False,\n",
1040+
" tokenize = True,\n",
10411041
" add_generation_prompt = True, # Must add for generation\n",
1042-
")\n",
1042+
" return_tensors = \"pt\",\n",
1043+
" return_dict = True,\n",
1044+
").to(\"cuda\")\n",
10431045
"\n",
10441046
"from transformers import TextStreamer\n",
10451047
"_ = model.generate(\n",
1046-
" **tokenizer(text, return_tensors = \"pt\").to(\"cuda\"),\n",
1048+
" **inputs,\n",
10471049
" max_new_tokens = 1024, # Increase for longer outputs!\n",
10481050
" temperature = 0.7, top_p = 0.95,\n",
10491051
" streamer = TextStreamer(tokenizer, skip_prompt = True),\n",

‎python_scripts/Kaggle-Magistral_(24B)-Reasoning-Conversational.py‎

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -7,34 +7,34 @@
77
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
88
# <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
99
# </div>
10-
#
10+
#
1111
# To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
12-
#
12+
#
1313
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)
14-
#
14+
#
1515

1616
# ### News
1717

18-
#
18+
#
1919
# Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).
20-
#
20+
#
2121
# [gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!
22-
#
22+
#
2323
# Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.
24-
#
24+
#
2525
# Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).
26-
#
26+
#
2727
# Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).
28-
#
28+
#
2929

3030
# # ### Installation
31-
#
31+
#
3232
# # In[ ]:
33-
#
34-
#
33+
#
34+
#
3535
# get_ipython().run_cell_magic('capture', '', 'import os\n\n!pip install pip3-autoremove\n!pip install torch torchvision torchaudio xformers --index-url https://download.pytorch.org/whl/cu128\n!pip install unsloth\n!pip install transformers==4.56.2\n!pip install --no-deps trl==0.22.2\n')
36-
#
37-
#
36+
#
37+
#
3838
# # ### Unsloth
3939

4040
# In[ ]:
@@ -240,17 +240,19 @@ def generate_conversation(example):
240240

241241

242242
messages = [
243-
{"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
243+
{"role" : "user", "content" : [{"type": "text", "text": "Solve (x + 2)^2 = 0."}]}
244244
]
245-
text = tokenizer.apply_chat_template(
245+
inputs = tokenizer.apply_chat_template(
246246
messages,
247-
tokenize = False,
247+
tokenize = True,
248248
add_generation_prompt = True, # Must add for generation
249-
)
249+
return_tensors = "pt",
250+
return_dict = True,
251+
).to("cuda")
250252

251253
from transformers import TextStreamer
252254
_ = model.generate(
253-
**tokenizer(text, return_tensors = "pt").to("cuda"),
255+
**inputs,
254256
max_new_tokens = 1024, # Increase for longer outputs!
255257
temperature = 0.7, top_p = 0.95,
256258
streamer = TextStreamer(tokenizer, skip_prompt = True),
@@ -260,7 +262,7 @@ def generate_conversation(example):
260262
# <a name="Save"></a>
261263
# ### Saving, loading finetuned models
262264
# To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
263-
#
265+
#
264266
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
265267

266268
# In[21]:
@@ -287,7 +289,7 @@ def generate_conversation(example):
287289

288290

289291
# ### Saving to float16 for VLLM
290-
#
292+
#
291293
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
292294

293295
# In[ ]:
@@ -316,12 +318,12 @@ def generate_conversation(example):
316318

317319
# ### GGUF / llama.cpp Conversion
318320
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
319-
#
321+
#
320322
# Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
321323
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
322324
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
323325
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
324-
#
326+
#
325327
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
326328

327329
# In[24]:
@@ -358,22 +360,22 @@ def generate_conversation(example):
358360

359361

360362
# Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.
361-
#
363+
#
362364
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
363-
#
365+
#
364366
# Some other links:
365367
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
366368
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
367369
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
368370
# 6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!
369-
#
371+
#
370372
# <div class="align-center">
371373
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
372374
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
373375
# <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
374-
#
376+
#
375377
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
376378
# </div>
377-
#
379+
#
378380
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
379-
#
381+
#

‎python_scripts/Magistral_(24B)-Reasoning-Conversational.py‎

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -7,34 +7,34 @@
77
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
88
# <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
99
# </div>
10-
#
10+
#
1111
# To install Unsloth your local device, follow [our guide](https://docs.unsloth.ai/get-started/install-and-update). This notebook is licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
12-
#
12+
#
1313
# You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)
14-
#
14+
#
1515

1616
# ### News
1717

18-
#
18+
#
1919
# Unsloth's [Docker image](https://hub.docker.com/r/unsloth/unsloth) is here! Start training with no setup & environment issues. [Read our Guide](https://docs.unsloth.ai/new/how-to-train-llms-with-unsloth-and-docker).
20-
#
20+
#
2121
# [gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) is now supported with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which creates kernels!
22-
#
22+
#
2323
# Introducing [Vision](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) and [Standby](https://docs.unsloth.ai/basics/memory-efficient-rl) for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.
24-
#
24+
#
2525
# Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).
26-
#
26+
#
2727
# Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).
28-
#
28+
#
2929

3030
# # ### Installation
31-
#
31+
#
3232
# # In[ ]:
33-
#
34-
#
33+
#
34+
#
3535
# get_ipython().run_cell_magic('capture', '', 'import os, re\nif "COLAB_" not in "".join(os.environ.keys()):\n !pip install unsloth\nelse:\n # Do this only in Colab notebooks! Otherwise use pip install unsloth\n import torch; v = re.match(r"[0-9]{1,}\\.[0-9]{1,}", str(torch.__version__)).group(0)\n xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")\n !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo\n !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer\n !pip install --no-deps unsloth\n!pip install transformers==4.56.2\n!pip install --no-deps trl==0.22.2\n')
36-
#
37-
#
36+
#
37+
#
3838
# # ### Unsloth
3939

4040
# In[ ]:
@@ -240,17 +240,19 @@ def generate_conversation(example):
240240

241241

242242
messages = [
243-
{"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
243+
{"role" : "user", "content" : [{"type": "text", "text": "Solve (x + 2)^2 = 0."}]}
244244
]
245-
text = tokenizer.apply_chat_template(
245+
inputs = tokenizer.apply_chat_template(
246246
messages,
247-
tokenize = False,
247+
tokenize = True,
248248
add_generation_prompt = True, # Must add for generation
249-
)
249+
return_tensors = "pt",
250+
return_dict = True,
251+
).to("cuda")
250252

251253
from transformers import TextStreamer
252254
_ = model.generate(
253-
**tokenizer(text, return_tensors = "pt").to("cuda"),
255+
**inputs,
254256
max_new_tokens = 1024, # Increase for longer outputs!
255257
temperature = 0.7, top_p = 0.95,
256258
streamer = TextStreamer(tokenizer, skip_prompt = True),
@@ -260,7 +262,7 @@ def generate_conversation(example):
260262
# <a name="Save"></a>
261263
# ### Saving, loading finetuned models
262264
# To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.
263-
#
265+
#
264266
# **[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!
265267

266268
# In[21]:
@@ -287,7 +289,7 @@ def generate_conversation(example):
287289

288290

289291
# ### Saving to float16 for VLLM
290-
#
292+
#
291293
# We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
292294

293295
# In[ ]:
@@ -316,12 +318,12 @@ def generate_conversation(example):
316318

317319
# ### GGUF / llama.cpp Conversion
318320
# To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.
319-
#
321+
#
320322
# Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
321323
# * `q8_0` - Fast conversion. High resource use, but generally acceptable.
322324
# * `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
323325
# * `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
324-
#
326+
#
325327
# [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
326328

327329
# In[24]:
@@ -358,22 +360,22 @@ def generate_conversation(example):
358360

359361

360362
# Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp.
361-
#
363+
#
362364
# And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
363-
#
365+
#
364366
# Some other links:
365367
# 1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
366368
# 2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
367369
# 3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
368370
# 6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!
369-
#
371+
#
370372
# <div class="align-center">
371373
# <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
372374
# <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
373375
# <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>
374-
#
376+
#
375377
# Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
376378
# </div>
377-
#
379+
#
378380
# This notebook and all Unsloth notebooks are licensed [LGPL-3.0](https://github.com/unslothai/notebooks?tab=LGPL-3.0-1-ov-file#readme).
379-
#
381+
#

0 commit comments

Comments
 (0)