-
Notifications
You must be signed in to change notification settings - Fork 31k
Open
Labels
Description
System Info
Hi this is using TRL but it seems like a lower level issue.
I'm training a variant of Qwen3 (Intern-S1-mini) but I'm not using the vision tower so it's effectively Qwen3-8B. I've been doing finetuning and checking different attention implementations i.e. SPDA vs. Flash Attention 2. However, I've been getting strange results where the downstream test accuracy is different (FA2 is worse). Furthermore, it seems like this issue is accentuated with grad accumulation. I'm not sure what's the best way to share this as my current code abstracts upon HF Trainer for my personal convenience.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Here are the current values of my config
"per_device_train_batch_size": 16,
"gradient_accumulation_steps": 2,
"optim": "paged_adamw_8bit",
"evaluation_strategy": "epoch",
"weight_decay": 0.1,
"gradient_checkpointing": true,
"use_liger_kernel": true,
"num_train_epochs": 1,
"learning_rate": 8e-05,
"lr_scheduler_type": "cosine",
"warmup_steps": 0,
"warmup_ratio": 0.1,
"report_to": "wandb",
"run_name": "finetune_Tox_internlm_Intern-S1-mini",
"logging_steps": 1,
"logging_strategy": "steps",
"save_strategy": "no",
"remove_unused_columns": false,
"seed": 42,
"completion_only_loss": false,
"dataset_text_field": "text",
"packing": false,
"padding_free": false,
"loss_type": "nll"```
### Expected behavior
They should have equal test accuracy.