Long wait time between evaluate and save (checkpoint creation)

I’m experimenting with whisper fine-tuning and encounter unreasonable long wait times after the evaluation phase finishes and the checkpoint gets generated.

During this period, I don’t see any reasonable work in Windows Task Manager, i.e. no GPU cuda/copy works, CPU ~13%, and no disk activity (only the core where the main thread is running has a bit higher usage).

In the following example, I use an augmented Common Voice language set with eval_steps=save_steps=600. I’m using Trainer.train() on GPU.

Here are the durations on i7-8900K & rtx-3090 for that 600 steps

Train Phase: 0:21:43
Eval Phase: 0:08:28
Extra Wait : 0:13:15

After the waiting part, the process continues. I’m using the following if applicable:

    per_device_train_batch_size = 64,
    gradient_accumulation_steps = 1,
    per_device_eval_batch_size = 16,
    eval_accumulation_steps = 1,

    optim = "adamw_torch",
    tf32 = True,
    fp16 = True,

    gradient_checkpointing = True,
    predict_with_generate = True,

...etc...

Where should I look? What can be the culprit?

Here is the CPU usage during the wait period, except the Python code, I have browsers and VS Code open.

image

Having same problem. Did you solve in some way?

Tnx a lot in advance.

No, I lived with it. But I did not train something for a while, so newer versions can be better - I hope.

When posting this, I forgot that period is the backpropagation duration, which seems to be single core (I think it is at the last logical core). I wouldn’t expected that to be so long though.

Just a guess, without looking into details…

It did not get better I am affraid.
I opened another topic. Hoping someone more experienced will tell

I solved by implementing training with pytorch lightning.

This example push model to hub but you can easily get a .ckpt classic torch format to reload.

@FDM1, wrong thread?

Nono. If you use pytorch lighting for training it is faster in both training and saving model.
Link is example implementation for custom dataset.

Oh, never used it, I’ll look at it. Thank you for sharing.

Hi @FDM1 , can you provide a resource I can use to learn how to switch to it?
I am trying to finetune whisper and saving checkpoints takes 2 hours!!!
How do I implement this?

Thanks in advance.