I am training gemma3-12b-it on a standard preference dataset. When I accelerate launch train.py on gemma3-12b-it in full precision, the training curve looks reasonable. However, if I switch from full precision to fp16, suddenly the logging shows loss=0, grad_norm=0, reward=nan.... Are multimodal models restricted to full precision training?
from datasets import load_dataset
from trl import RewardTrainer, RewardConfig, DPOConfig, DPOTrainer
from peft import LoraConfig, TaskType
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "gemma-3-12b-it"
model = AutoModelForCausalLM.from_pretrained(model_name, attn_implementation="eager")
tokenizer = AutoTokenizer.from_pretrained(model_name)
train_dataset = load_dataset("json", data_files="training_data.json", split="train")
tokenizer.pad_token = tokenizer.eos_token
def process_training_data(example):
example["prompt"] = example.pop("input")
example['rejected'] = example['rejected'][0]
return example
train_dataset = train_dataset.map(process_training_data)
training_args = DPOConfig(
dataloader_pin_memory=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
logging_steps=10,
# fp16=True
)
training_args.optimize_cuda_cache=True
peft_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
]
)
trainer = DPOTrainer(model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset,
peft_config=peft_config)
trainer.train()
Perhaps mixed precision training issue?
opened 07:58AM - 23 Jul 23 UTC
closed 08:03AM - 31 Aug 23 UTC
### System Info
pytorch 1.13.1
transformers==4.31.0
### Who can help?
… Hi @sgugger ,
I used the 4.31.0 to train a Llama model with LoRA. I observe some problems with --fp16 training and I'm not sure if it is a bug in Trainer.py:
My model is like:
```
class MyModel(nn.Module):
def __init__(self, model_name):
super().__init__()
self.model_name = model_name
self.base_model = LlamaForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
self.base_model = get_peft_model(self.base_model, lora_config)
self.other_modules = nn.Linear(4096, 4096)
```
I used the Trainer to train the model with the following command line:
`torchrun --nproc_per_node=4 main.py --max_steps 100000 --fp16
`
I find the model's gradients (in self.optimizer in the Trainer) are not fp16 but fp32. Is it correct?
Also, I find that no gradient_scaling is conducted during training since self.do_grad_scaling is always False (because self.sharded_ddp is None and args.half_precision_backend will be always "auto"). The current trainer.py will not correctly set up args.half_precision_backend and scaler if self.sharded_ddp is None. Are these observations expected? I'm a little confused why setting up args.half_precision_backend and scaler require sharded_ddp. As a result, I've found that during the training process, I often encounter the loss becoming NaN. I'm not sure whether it is because no gradient_scaling is conducted and half_precision_backend is not correctly set up during training.
Following are my grad_norm (before grad_clipping) with and without --fp16. (My base model here is "JackFram/llama-160m" for debugging) **The results are significantly different.**
Without --fp16:
step 1: grad_norm=0.059
Step 5: grad_norm=0.054
Step 10: grad_norm=0.048
Step 15: grad_norm=0.050
Step 20: grad_norm=0.050
With --fp16:
Step 1: grad_norm = nan
Step 5: grad_norm = 129.88
Step 10: grad_norm=126.98
Step 15: grad_norm=149.58
Step 20: grad_norm=80.7
```
def compute_grad_norm(optimizer): # the function to compute grad_norm
total_norm = 0.0
for group in optimizer.param_groups:
for param in group['params']:
if param.grad is not None:
param_norm = param.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = torch.sqrt(torch.tensor(total_norm))
return total_norm
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)
### Expected behavior
do_grad_scaling=True when --fp16 is enabled; rarely confronting loss becoming nan
Could you check the dtype of the LoRA parameters after model initialization? Specifically, are they float16 or float32?