Hi. I’m struggling with the date format to use in my datasets to fine-tune QwenX.X and Gemma-X.X models. The models is always inferring the wrong dates after fine-tuning. I tried several formats as shown in the example below to no avail. I would appreciate any guidance regarding the right date format to use or how to discover it for a given model.
Examples of date formats I test but none of them worked:
user: When did XYZ work for ABC. Assistant: From January 2021 till December 2021.
user: When did XYZ work for ABC. Assistant: From 01, 2021 till 12, 2021.
user: When did XYZ work for ABC. Assistant: From 01-01-2021 till 12-01- 2021.
user: When did XYZ work for ABC. Assistant: From 01-01-2021 till 12-01- 2021 .
user: When did XYZ work for ABC. Assistant: From January 2021 01-01-2021 till December 2021 12-01- 2021 .
The issue is usually consistency and how the training examples are formatted. Avoid mixing natural language, commas, spaces, and different date styles in the same dataset. Also avoid adding fake exact days if your source data only has month/year.
I’d use one clear format everywhere, for example:
From 2021-01 to 2021-12
or, if you need exact dates:
From 2021-01-01 to 2021-12-01
Thanks @Vultieris I appreciate your valuable feedback. I’ll test it. Cheers!
Unfortunately it didn’t work out. My fine-tuned model is still making up invalid dates. I’m beginning to think the issue lies elsewhere outside the scope of date formats. I’ll keep digging!
I should add that I have witnessed this issue across the board while fine-tuning several QwenX.X and LlamaX.X models using SFT. I have traced the tokenizing of the dates and displayed the trainer’s batch input-ids right before starting training. They matched the dates found in the dataset.