T5 decoder predicting tokens even after hitting end of sequence token, i.e </s>

hwaseem04 · February 24, 2024, 10:23am

I am using T5 model for a seq2seq task. I ensured to replace padding tokens with -100 for labels. The below is my tokenizer configuration

max_source_length = 90
max_target_length = 90
def tokenization_function(batch):
     model_inputs = tokenizer(batch['user_request'], padding="max_length", max_length=max_source_length, truncation=True, return_tensors="pt")
     labels = tokenizer(batch['command'], padding="max_length", max_length=max_target_length, truncation=True, return_tensors="pt")
     model_inputs["decoder_attention_mask"] = labels['attention_mask']
     labels = labels["input_ids"]
     labels[labels == tokenizer.pad_token_id] = -100
     model_inputs["labels"] = labels
     return model_inputs

tokenized_dataset = dataset.map(tokenization_function, batched=True, batch_size=1024)
tokenized_dataset

After training, I do inference using the below script

with torch.no_grad():
    for iter, batch in enumerate(eval_dataloader):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']
        input_ids = input_ids.to(device); attention_mask = attention_mask.to(device); labels = labels.to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        pred = torch.argmax(outputs['logits'], axis=-1)
        for i, p in enumerate(pred):
            if torch.where(p==1)[0].size(0) != 0:
                idx = torch.where(p==1)[0][0]
                seq = p[:idx].reshape(1,-1)
            else: 
                seq = p.reshape(1,-1)
            pred_text = tokenizer.batch_decode(seq)
            print(batch['command'][i])
            print(pred_text[0])
            print()
        break

for instance, pred[0] has the below value after applying argmax

tensor([ 1041,   834,  6583,   283, 26479,  3876,   834,  6583,     3,  4254,
        25528,    16, 10646,   834,  5540,  5839,   804,   834,  5540, 15959,
         3856,    15,    44, 15959,  3138,     1,     1,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,
         1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041,  1041],
       device='cuda:0')

shouldn’t the auto regression of decoder stop after predicting id: 1, because with my limited knowledge I beleive 1 corresponds to end of token ‘< / s>’. But instead why am I getting 1041 till max length is reached, i.e 90. Is it an desirec output? What should I do to stop my prediction right after token is predicted?

I am a beginner in working on language models, so please feel free to pin point any other issues in the snippets

cc: @nielsr

nielsr · February 24, 2024, 11:06am

Hi,

At inference time, it’s recommended to use the generate() method which takes care of autoregressive generation.

See my notebooks regarding fine-tuning T5 for a seq2seq task: Transformers-Tutorials/T5 at master · NielsRogge/Transformers-Tutorials · GitHub. They include an inference section.

hwaseem04 · February 25, 2024, 8:45am

Thank you, now I am not getting the arbitary token value after end of sequence, i.e I am not getting 1041. But still I am getting 0s, which corresponds to padding tokens.

Using this I generated generated_ids: generated_ids = model.generate(input_ids, do_sample=False, max_length=max_target_length)

One sample from generated_ids:

tensor([    0,  1041,   834,  6583,   283, 26479,  3876,   834,  6583,     3,
         3463,     4,   382,    16, 10646,   834,  5540,  5839,   804,   834,
         5540, 15959,  3138,  1499,     1,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0], device='cuda:0')

after decoding using tokenizer.batch_decode(generated_ids) I get

<pad> action_para MOVE component_para TEXT intial_state none final_state swapped text</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Ideally the values should go away right? Am i missing something?

PS: Apologize, it works when I add skip_special_tokens=True in generated_ids = model.generate(input_ids, do_sample=False, max_length=max_target_length, skip_special_tokens=True) as in your example notebooks. Thank you.

nielsr · February 26, 2024, 2:46pm

No that seems correct, so the model has generated the end of sequence token (with ID=1), after which generation stops. One usually provides skip_special_tokens=True as well to the batch_decode method in order to skip special tokens (like end of sequence, or padding tokens):

generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Topic		Replies	Views
How to make T5 model know when to stop generating during inference? 🤗Transformers	2	43	October 1, 2025
ONNX T5 - Decoding seq2seq tokens 🤗Tokenizers	1	528	May 8, 2024
-100 in predictions Beginners	1	90	December 20, 2024
T5 tokenizer's post-processor is suboptimal for truncated sequences for seq2seq finetuning 🤗Transformers	0	354	July 5, 2023
Generate without using the generate method Intermediate	8	6764	January 17, 2025

T5 decoder predicting tokens even after hitting end of sequence token, i.e </s>

Related topics