How to pass input to a Reward Model and make sense of its output?

sraj · February 13, 2024, 12:52pm

I am in the process of doing RLHF on LLaMA 2 13b. One of the steps is making a Reward Model.
Using a custom dataset of texts that are better and comparatively not so better, I made a dataset. Lets say that it is very similar to the example thats there in the official TRL library - “chosen” and “rejected” - Reward Modeling

The Reward Model was successfully made (the eval accuracy as seen in the logs was about 67% but thats a story for a different day).

Now what I would like to do is to actually pass an input and see the output of the Reward model.

However I can’t seem to make any sense of what the reward model outputs.

For example: I tried to make the input as follows -

chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}

Then I try -

rewards_chosen = model(
            **tokenizer(chosen, return_tensors='pt')
        ).logits
print('reward chosen is ', rewards_chosen)

rewards_rejected = model(
           **tokenizer(rejected, return_tensors='pt')
        ).logits

print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)

Printing loss wasn’t helpful. I mean I do not see any trend even if I switch rewards_chosen and rewards_rejected in the formula.

Also the outputs did not yield any big insights. I do not understand how to make sense of rewards chosen and rewards rejected. I have had examples where rewards chosen is bigger and then in other when its smaller (shouldn’t it always be higher?).

I tried rewards_chosen>rewards_rejected but that is also not helpful since it outputs tensor([[ True, False]])

How do we figure out what is the meaning of the output of the reward model, how do we know what string it is preferring?

DarshanDeshpande · March 8, 2024, 9:20pm

Did you ever find the answer? For some reason, I am stuck with the 67% accuracy problem too (after extensive lora hyperparam tuning) and maybe that is the reason for the outputs not showing patterns.

Topic		Replies	Views
TRL Library (how to load the reward model and calculate score from some prompt answer pairs) 🤗Transformers	0	307	February 29, 2024
Scalar Reward Model 🤗Transformers	2	93	April 8, 2025
Training RewardTrainer - Does the number of labels matter? 🤗Transformers	0	40	February 13, 2025
Huggingface DecisionTransformer - Reward Calculation Beginners	0	252	September 15, 2022
PPO using TRL: optimal strategy for reward calculation? Research	1	1048	December 20, 2023

How to pass input to a Reward Model and make sense of its output?

Related topics