Trying to recreate `model.greedy_search()` for custom decoding of LLM output, but I am getting a different decoded output

DenseLance · January 28, 2024, 11:17am

I have recreated model.greedy_search() in 2 different ways, with the main difference being the size of input_ids.

Model Initialization

import torch
import transformers

# USER CONFIGURATIONS
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "EleutherAI/gpt-neo-1.3B"

bnb_config = transformers.BitsAndBytesConfig(
   load_in_4bit = True,
   bnb_4bit_quant_type = "nf4",
   bnb_4bit_use_double_quant = True,
   bnb_4bit_compute_dtype = torch.bfloat16
)

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, cache_dir = "./Models/")
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, quantization_config = bnb_config, cache_dir = "./Models/")

Generation Configurations

# GENERATION INPUTS
num_gen_tokens = 20
prompt = "ewrcewkr oewrkcl ewrkewr\n"
input_ids = tokenizer(prompt, return_tensors = "pt").to(device).input_ids.squeeze() # batched = False

# CONFIRM PRE-TRAINED CONFIGURATIONS
model.generation_config.pad_token_id = model.generation_config.eos_token_id
assert tokenizer.bos_token_id == model.generation_config.bos_token_id
assert tokenizer.eos_token_id == model.generation_config.eos_token_id
assert not ((model.generation_config.eos_token_id is None) ^ (model.generation_config.pad_token_id is None))

Variation 1: My code, where input_ids.shape = Size(num_input_tokens)

output_ids = input_ids.clone().detach()
model.eval()
with torch.no_grad():
    for _ in range(num_gen_tokens):
        if model.generation_config.eos_token_id is None or output_ids[-1] != model.generation_config.eos_token_id:
            outputs = model(output_ids)
            next_token_logits = outputs.logits[-1] # only consider the logits output based on last token of input
            next_tokens = next_token_logits.argmax(dim = -1).unsqueeze(dim = -1)
            output_ids = torch.cat((output_ids, next_tokens), dim = -1)

print(tokenizer.decode(output_ids))

Variation 2: My code, where input_ids.shape = Size(1, num_input_tokens)

output_ids = input_ids.clone().detach().unsqueeze(dim = 0)
model.eval()
with torch.no_grad():
    for _ in range(num_gen_tokens):
        if model.generation_config.eos_token_id is None or output_ids[:, -1] != model.generation_config.eos_token_id:
            outputs = model(output_ids)
            next_token_logits = outputs.logits[:, -1] # only consider the logits output based on last token of input
            next_tokens = next_token_logits.argmax(dim = -1).unsqueeze(dim = -1)
            output_ids = torch.cat((output_ids, next_tokens), dim = -1)

print(tokenizer.decode(output_ids.squeeze()))

Variation 3: HuggingFace API, where input_ids.shape = Size(1, num_input_tokens)

output = model.greedy_search(input_ids.clone().detach().unsqueeze(dim = 0), stopping_criteria = transformers.StoppingCriteriaList([transformers.MaxLengthCriteria(max_length = 20 + input_ids.size(dim = -1))])).squeeze()
print(tokenizer.decode(output))

In most cases, the generated tokens that are returned should be the same in all 3 methods utilized. However there are 2 cases I found that seems to violate this rule (note that only the prompt was changed, the rest of the variables remained the same):

Case 1: prompt = "ewrcewkr oewrkcl ewrkewr\n", Variation 2 seems to be the odd one out

Variation 1’s output

ewrcewkr oewrkcl ewrkewr

I am a very simple person. I love to read, watch movies, and play video games

Variation 2’s output

ewrcewkr oewrkcl ewrkewr

I am a very simple person. I am very easy going and I like to be around people

Variation 3’s output

ewrcewkr oewrkcl ewrkewr

I am a very simple person. I love to read, watch movies, and play video games

Case 2: prompt = tokenizer.bos_token + "ewrcewkr oewrkcl ewrkewr\n", Variation 3 seems to be the odd one out

Variation 1’s output

<|endoftext|>ewrcewkr oewrkcl ewrkewr

wewrcewkr oewrkcl ewrkewr

(a

Variation 2’s output

<|endoftext|>ewrcewkr oewrkcl ewrkewr

wewrcewkr oewrkcl ewrkewr

(a

Variation 3’s output

<|endoftext|>ewrcewkr oewrkcl ewrkewr

The following is a list of the most common words in the English language.

The most

I have 3 questions regarding the difference in outputs (as seen above):

What should be the expected input_shape into model.forward()? Is it Size(1, num_input_tokens), or Size(num_input_tokens)? If the input is of Size(1, num_input_tokens), outputs.logits would have Size(torch.Size([1, num_output_tokens, num_tokenizer_tokens]). If the input is of Size(num_input_tokens), outputs.logits would have Size(torch.Size([num_output_tokens ** 2, num_tokenizer_tokens]).
Does my code correctly model how the LLM decodes the output via greedy search?
What is causing this difference in decoded output across all 3 methods used?

Thank you in advance.

DenseLance · January 28, 2024, 11:28am

By the way, I have also recreated model.beam_search(), and found no issues yet after testing for a while. Seems like the problem is only with model.greedy_search(), despite the algorithm being deterministic.

DenseLance · February 8, 2024, 3:44pm

I figured out the answers to my own questions, so I’ll share my own findings here.

Q1 and Q2: Expected input_shape should be Size(batch_size, num_input_tokens), which also means that our input is permanently batched. Size(num_input_tokens) just does not work for other models like Llama2, due to how data is processed under the hood in model.forward(). As such, variation 1 is not recommended and variation 2 should be the way to go.

Q3: I tried to replicate the issue on Colab (I had originally used my local machine’s GPU), and found that case 1’s variation 2 should produce the same output as variation 3, and case 2’s variation 3 should produce the same output as variation 2. The above code snippets for variation 2 and 3 that I have shared are correct as it is. However, you might want to change unsqueeze(dim = -1) to view(-1, 1) instead, because during one of my trials one of the tensors had its last dimension expanded for some reason (which was wrong!).

Topic		Replies	Views
Results of model.generate are different for different batch sizes of the decode-only model Beginners	6	6493	April 14, 2024
Question about greedy_search 🤗Transformers	4	1843	June 18, 2021
How to generate without decoding? 🤗Transformers	1	415	December 13, 2023
Rewriting generate function for manual decoder input 🤗Transformers	7	3660	July 11, 2022
Understanding Output of `PreTrainedModel.forward` Beginners	2	2219	February 12, 2024

Trying to recreate `model.greedy_search()` for custom decoding of LLM output, but I am getting a different decoded output

Related topics