Target {} is out of bounds

Hi,
I am following this fantastic notebook to fine-tune a multi classifier.

Context:

  1. I am using my own dataset.
  2. Dataset is a CSV file with two values, text and label.
  3. Labels are all numbers.
  4. I have 7 labels.
  5. When loading the pre-trained model, I am assigning num_labels=7.
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",num_labels=7)

When training, I am receiving this error:

/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   2844     if size_average is not None or reduce is not None:
   2845         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 2846     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
   2847 
   2848 

IndexError: Target 7 is out of bounds.

I have tried changing the number of labels to 2 and 5 and that didn’t solve the issue. Still getting out of bounds error.

Training arguments:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokinized_jobs["train"],
    eval_dataset=tokinized_jobs["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

and here is how tokenized data look like

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 1598
    })
    test: Dataset({
        features: ['attention_mask', 'input_ids', 'label', 'text', 'token_type_ids'],
        num_rows: 400
    })
})

Sample:

{
'attention_mask': [1, 1, 1, 1, 1, 1, 1],
 'input_ids': [101, 1015, 1011, 2095, 3325, 6871, 102],
 'label': 2,
 'text': '1-year experience preferred',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]
}

I tried it on Colab with GPU and TPU.

Any idea what is the issue?

I found the solution. It was an indexing issue with my labels.
My labels were starting from 1 to 8, I changed them to 0…7 and that fixed the issue for me.

Credit to this answer on Stackoverflow.

Hope this will help someone in the future.

Thanks, this worked for me

How do we go about changing the labels for semantic segmentation mask? i’ve attemped np.where() and converting back to a PIL image with no success

For Anyone who has already ensured they’re zero indexed, what worked for me was to specify num_labels as a parameter to the model:

model = AutoModelForSequenceClassification.from_pretrained(“FacebookAI/roberta-base”, num_labels = 52)

thanks, this one save me num_labels = 52

I used AutoConfig
config = AutoConfig.from_pretrained(model_id)

config.update({"num_labels": 7 })

model = RobertaForSequenceClassification.from_pretrained(model_id, config=config)