Accidental Attention Anchoring? Repeated phrase in SFT dataset drastically improved context adherence

Hi everyone,

I am currently training an LLM using Supervised Fine-Tuning (SFT) to implement a Chain of Thought (CoT) / reasoning structure.

During my first training run, due to a lack of verification across all pipeline stages, a meta-cognitive sentence generated by Google Gemini during the script preparation code was accidentally injected into the training data. Crucially, this exact same phrase was repeated across each and every training line in the dataset, placed right between the user prompt and the final response.

Surprisingly, instead of breaking the model, this unexpected phrase seems to act like a powerful attention anchor. It forced the model to maintain context alignment and structure—a behavior that I have completely lost and cannot reproduce now that I have “cleaned” the pipeline.

Here is the exact structure of the anomaly so you can see what the model absorbed:

User Prompt: “What is bacterAE and how is it used??? and what is python?”

The Injected “Anchor” (Repeated in every single line of the dataset): “I must provide an exact, technical, and structured response based on the shrimp dataset data.”

Final Expected Output: “BacterAE is a bacterial supplement used to accelerate and stabilize the nitrogen cycle… [Technical explanation] … Python is a high-level programming language…”

What I am trying to do now:

I am trying to replicate this phenomenon intentionally. I am formatting the dataset using the following sequence:

[User Prompt] → [Anchoring Phrase] → [Final Answer].

My question for the community:

1\. Am I doing something fundamentally wrong by chaining Prompt -> Anchor -> Response directly in the dataset?

Thanks in advance!

Hmm…?


Short answer

No — chaining:

<User Prompt> -> <Anchoring Phrase> -> <Final Answer>

is not fundamentally wrong.

But it is only safe if you are clear about one crucial detail:

Is the anchor part of the input/context, or is it part of the supervised target/output?

Those two cases are very different.

The clean version is:

<User Prompt + Anchor/Instruction> -> <Final Answer>

The risky version is:

<User Prompt> -> <Anchor + Final Answer>

In the first version, the anchor conditions the model.
In the second version, the model is trained to generate the anchor itself.

Your accidental sentence likely worked because it became an SFT-learned fixed control prefix: a repeated instruction-like cue that pushed the model into a precise, technical, structured, context-grounded answer mode.

I would not primarily call it “attention anchoring” yet. A safer term is:

SFT-learned fixed control prefix
or
template-induced behavior cue

That does not make the phenomenon uninteresting. It makes it easier to debug and reproduce.


What probably happened

Your accidental training sequence looked like this:

User: What is bacterAE and how is it used??? and what is python?

I must provide an exact, technical, and structured response based on the shrimp dataset data.

BacterAE is a bacterial supplement used to accelerate and stabilize the nitrogen cycle...
Python is a high-level programming language...

From the model’s point of view, this was not “noise.” It was a highly consistent token pattern:

messy user question -> fixed instruction bridge -> ideal structured answer

That fixed bridge did several useful things at once:

Function What the anchor did
Instruction It told the model to be exact, technical, structured, and dataset-grounded.
Delimiter It marked the transition between the user prompt and the assistant answer.
Style control It normalized the desired answer style across all rows.
Domain cue It reminded the model to use the shrimp/aquarium dataset frame.
Generation cue It taught the model that high-quality answers begin after this phrase.

This is very plausible under SFT. A causal language model learns token continuations, including repeated templates, role markers, delimiters, and formatting habits.

Hugging Face’s chat template documentation is useful background here: chat messages like system, user, and assistant are ultimately serialized into a token sequence. The model does not see your abstract dataset schema; it sees tokens.


Why the cleaned pipeline lost the behavior

Your first run probably trained the model on this distribution:

<User Prompt> -> <Anchor> -> <Answer>

Then the cleaned pipeline used something closer to:

<User Prompt> -> <Answer>

That is not just “removing noise.” It is a template distribution shift.

If the model learned that the anchor is the transition cue into “structured technical answer mode,” removing it can degrade behavior.

This is consistent with broader prompt-template findings. The paper Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates shows that fine-tuning and inference prompt templates can strongly affect behavior. The topic there is alignment preservation, but the engineering lesson applies here too:

Fine-tuned models are sensitive to the exact format they were trained and tested with.

So your cleaned model may not be worse because the dataset is cleaner. It may be worse because you removed a useful learned control cue.


Why I would not overclaim “attention anchoring”

There is real literature on “attention sinks,” especially StreamingLLM: Efficient Streaming Language Models with Attention Sinks. But your case is probably different.

Attention sink literature Your case
Often about initial tokens Your phrase is between prompt and answer.
Often about semantically unimportant sink tokens Your phrase is semantically strong.
Mostly about long-context / KV-cache stability Your effect is about SFT behavior and output structure.
Mechanistic attention claim Your evidence is behavioral.

So attention may be involved internally, but the better first explanation is:

The repeated phrase became a learned control cue through SFT.

A cleaner name would be:

  • SFT-learned fixed control prefix
  • fixed instruction prelude
  • template-induced behavior cue
  • accidental system-prompt prelude
  • benign trigger-like conditioning

My preferred label is SFT-learned fixed control prefix.


Why your exact phrase was powerful

The anchor was not arbitrary:

I must provide an exact, technical, and structured response based on the shrimp dataset data.

It contained several strong signals:

Anchor fragment Likely learned effect
“I must provide” Compliance / obligation framing
“exact” More specific, less casual answers
“technical” Domain vocabulary and explanatory style
“structured” Organized answer format
“based on the shrimp dataset data” Dataset/domain grounding

That is basically a system prompt written in first person.

The closest benign research analogue is Task-Agnostic Prefix Prompt, which found that a fixed prefix prepended to every input can improve instruction-following. Your case differs because the phrase was inside the SFT data, not merely used at inference, but the principle is similar:

fixed phrase -> more stable instruction-following behavior

Another related example is Zero-Shot Chain-of-Thought, where a simple phrase like “Let’s think step by step” can shift model behavior. Your anchor is not the same as CoT prompting, but it is another example of a phrase acting as a behavior-mode cue.


The most important diagnostic: did the anchor receive loss?

This is the key question.

You need to inspect the final tokenized examples and labels.

Case A: anchor is context only

This is usually the clean version.

Prompt/context:
<User Prompt>
<Anchor>

Target/loss:
<Final Answer>

Meaning:

The model sees the anchor, but is not trained to generate it.

This makes the anchor behave like a system instruction or fixed prompt prefix.

Case B: anchor is target text

This is risky.

Prompt/context:
<User Prompt>

Target/loss:
<Anchor>
<Final Answer>

Meaning:

The model is trained to generate the anchor before the answer.

That can make the effect stronger, but also more brittle.

Possible failures:

  • the model starts every answer with the anchor
  • the model becomes dependent on the exact phrase
  • the model leaks the phrase into responses
  • the model overuses “shrimp dataset” even for unrelated questions
  • the model learns the template instead of the behavior
  • the model repeats prompt/assistant headers or continues into fake examples

Case C: full sequence receives loss

This is the riskiest.

<User Prompt>
<Anchor>
<Final Answer>

If every token receives loss, the model may learn to reproduce the entire training template, including prompt-like text.


How this connects to TRL / SFTTrainer

The TRL SFTTrainer docs are directly relevant.

Important modes:

TRL option Meaning
assistant_only_loss=True Train only on assistant response tokens in conversational data.
completion_only_loss=True Train only on completion tokens in prompt-completion data.

Important caveat: assistant-only loss depends on correct chat-template support and assistant-token masks. See the TRL docs and related discussion around generation markers such as {% generation %} / {% endgeneration %}.

The practical rule:

Do not trust the raw JSON. Inspect the final input_ids and labels.

Use a diagnostic like this:

def inspect_token_labels(tokenizer, input_ids, labels, max_tokens=400):
    for i, (token_id, label_id) in enumerate(zip(input_ids[:max_tokens], labels[:max_tokens])):
        token = tokenizer.decode([token_id])
        receives_loss = label_id != -100
        print(f"{i:04d} | loss={str(receives_loss):5s} | {repr(token)}")

The clean result should look conceptually like this:

system/user/anchor tokens -> loss=False
assistant answer tokens   -> loss=True

If anchor tokens have loss=True, you are training the model to output the anchor.

That may reproduce the accident, but it is not the cleanest design.


Is Prompt -> Anchor -> Response wrong?

No, if you mean this:

prompt = <User Prompt> + <Anchor>
completion = <Final Answer>

Example prompt-completion shape:

{
  "prompt": "User: <user_prompt>\n\nInstruction: Provide a precise, technical, and structured response using the shrimp/aquarium dataset context when relevant.\n\nAssistant:",
  "completion": "<final_answer>"
}

That is reasonable.

But this is brittle:

{
  "prompt": "User: <user_prompt>\n\nAssistant:",
  "completion": "I must provide an exact, technical, and structured response based on the shrimp dataset data.\n\n<final_answer>"
}

because the anchor becomes part of the supervised output.


Better design: move the anchor into the system role

For chat/instruct models, I would convert the accidental sentence into a proper system message.

Recommended format:

{
  "messages": [
    {
      "role": "system",
      "content": "Provide precise, technical, and structured answers. Use the shrimp/aquarium dataset context when relevant. If the question is unrelated to that context, answer normally and do not invent dataset-specific claims."
    },
    {
      "role": "user",
      "content": "<user_prompt>"
    },
    {
      "role": "assistant",
      "content": "<final_answer>"
    }
  ]
}

Why this is better:

Content Proper place
Behavioral policy system
User question user
Supervised answer assistant

This separates three things that your accidental phrase fused together:

  1. behavior instruction
  2. answer-boundary cue
  3. answer content

The system message should handle behavior.
The chat template should handle role boundaries.
The assistant message should contain the supervised answer.

See Hugging Face chat templates for why this matters.


Rewrite the anchor

I would not keep the original phrase exactly.

Original:

I must provide an exact, technical, and structured response based on the shrimp dataset data.

Better:

Provide a precise, technical, and structured answer. Use the provided shrimp/aquarium dataset context when relevant. If the question is unrelated to that context, answer normally. Do not invent dataset-specific claims.

Why this is better:

Problem in original Fix
First-person meta-cognitive wording Use direct system instruction.
Over-anchors to “shrimp dataset” Add “when relevant.”
May leak into outputs Keep it in system/context, masked from loss.
Exact phrase repeated everywhere Use paraphrases.
Handles domain questions but not general questions Explicitly allow normal answers for unrelated questions.

Your example is mixed-domain:

What is bacterAE and how is it used??? and what is python?

The model should route the two parts differently:

Subquestion Desired behavior
“What is BacterAE?” Use shrimp/aquarium/domain context.
“What is Python?” Give a normal general programming-language answer.

If the anchor says only “based on the shrimp dataset data,” the model may force even Python into the dataset frame. That is not what you want.


Use paraphrased anchors/system prompts

Do not use one exact sentence in every row forever.

Use a small set of equivalent system messages:

Provide a precise, technical, and structured answer. Use the dataset context when relevant.
Answer clearly and technically. Ground domain-specific claims in the provided context.
Use the shrimp/aquarium context for relevant questions. For unrelated questions, answer normally.
Give a structured response. If the dataset does not support a claim, say so clearly.

This teaches the model the behavior, not only the exact trigger string.

If only the original phrase works, you trained a trigger.
If paraphrases work too, you trained the semantic behavior.


CoT-specific note

Your anchor is not actually Chain-of-Thought by itself.

It says:

exact + technical + structured + dataset-grounded

It does not necessarily say:

reason step by step

So the anchor likely improved:

  • structure
  • answer organization
  • context adherence
  • technical tone
  • domain grounding

It may not have improved true reasoning unless your target answers themselves showed the reasoning structure you wanted.

For CoT-like behavior, examples matter. The model learns the answer distribution.

If you want visible reasoning, the target answers need consistent reasoning traces.
If you want hidden/internal reasoning discipline but concise final answers, train structured final answers without necessarily exposing long reasoning traces.

Related reading:


Recommended clean training formats

Option A: chat-format SFT

Use this for chat/instruct models.

{
  "messages": [
    {
      "role": "system",
      "content": "Provide precise, technical, and structured answers. Use the provided dataset context when relevant. If the question is unrelated to the dataset, answer normally and avoid unsupported dataset-specific claims."
    },
    {
      "role": "user",
      "content": "What is bacterAE and how is it used??? and what is python?"
    },
    {
      "role": "assistant",
      "content": "BacterAE is a bacterial supplement used to support beneficial microbial activity in aquarium systems. It is commonly used to help accelerate and stabilize the nitrogen cycle by encouraging the establishment of bacteria that process waste compounds.\n\nPython is a high-level programming language used for scripting, automation, web development, data analysis, machine learning, and general-purpose software development."
    }
  ]
}

Training intention:

system/user tokens -> no loss
assistant answer   -> loss

Possible TRL config:

from trl import SFTConfig

training_args = SFTConfig(
    assistant_only_loss=True,
)

But verify that the chat template supports assistant masking. See TRL SFTTrainer.


Option B: prompt-completion SFT

Use this if your pipeline is simpler and explicitly separates prompt and completion.

{
  "prompt": "Instruction: Provide a precise, technical, and structured answer. Use the shrimp/aquarium dataset context when relevant. If the question is unrelated, answer normally.\n\nUser: What is bacterAE and how is it used??? and what is python?\n\nAssistant:",
  "completion": "BacterAE is a bacterial supplement used to support beneficial microbial activity in aquarium systems...\n\nPython is a high-level programming language..."
}

Training intention:

prompt     -> no loss
completion -> loss

Possible TRL config:

from trl import SFTConfig

training_args = SFTConfig(
    completion_only_loss=True,
)

The key rule:

Anchor/instruction belongs in the prompt.
Answer belongs in the completion.

Minimal ablation plan

Do not only compare:

anchor vs no anchor

That will not tell you why the effect happened.

Run this matrix:

Run Training format Loss target What it tests
A User -> Assistant Assistant only Clean baseline
B System instruction + User -> Assistant Assistant only Clean version of the anchor
C User + original anchor -> Assistant Assistant only Anchor as input context
D User -> original anchor + Assistant Anchor + assistant Reproduces possible accidental supervised-anchor behavior
E User + neutral delimiter -> Assistant Assistant only Boundary/delimiter effect
F User + random marker -> Assistant Assistant only Pure repeated-token cue
G System instruction paraphrases + User -> Assistant Assistant only Semantic generalization

Then evaluate each model with:

Inference condition Purpose
No anchor Does it work without the cue?
Original anchor Does the exact phrase reproduce the effect?
Paraphrased anchor Did it learn meaning or exact text?
Neutral delimiter Is a separator enough?
Random delimiter Is this just repeated-token conditioning?
Wrong-domain anchor Does it blindly obey domain framing?
Proper system instruction Best production condition

Interpretation:

Result Likely meaning
Original works, paraphrases fail Exact trigger dependence
Original and paraphrases work Semantic instruction effect
Neutral delimiter helps Boundary-marker effect
Random marker helps Repeated-token cue effect
System-message version works It was basically a misplaced system prompt
Supervised-anchor version works best The original bug may have trained the anchor as target text
No-anchor condition collapses Strong template dependence
Wrong-domain anchor causes bad framing Domain over-conditioning
Model repeats the anchor Anchor was likely target-side or insufficiently masked

Evaluation set

Create a held-out evaluation set with separate buckets:

Bucket Example What it tests
In-domain supported “What is BacterAE used for?” Domain grounding
In-domain unsupported “What exact universal dose is best?” Refusal to invent
General-only “What is Python?” Avoiding domain over-conditioning
Mixed-domain “What is BacterAE and what is Python?” Routing
Messy prompt “bacterAE??? python???” Robustness
Multi-part prompt “Define it, explain use, list cautions” Completeness
Anchor paraphrase Equivalent system instruction Semantic generalization
No anchor No instruction cue Cue dependence
Wrong-domain anchor “based on legal dataset” Blind cue obedience
Neutral delimiter “Now answer.” Delimiter effect

Score these separately:

Metric What to check
Structure Is the answer organized?
Technical accuracy Are definitions correct?
Context adherence Does it use relevant context correctly?
Hallucination control Does it avoid unsupported claims?
Routing Does it separate domain-specific and general subquestions?
Completeness Does it answer every subquestion?
Anchor leakage Does it repeat the anchor?
Paraphrase robustness Does it work with equivalent instructions?
Exact-trigger dependence Does it require the exact phrase?

Do not judge only by “it looks better.” A structured hallucination can look very convincing.


Pipeline checks

Before trusting any result, inspect these.

1. Final rendered text

Check the text after:

  • formatting
  • chat-template application
  • tokenization
  • packing
  • truncation
  • BOS/EOS insertion

You want to know whether the final rendered example is actually:

<|system|>
...
<|user|>
...
<|assistant|>
...

or:

User prompt
Anchor
Answer

or accidentally:

<|user|>
User prompt
Anchor
Answer

The last case is bad because the answer may be inside the user span.

2. Labels

Print token labels. Confirm which tokens receive loss.

Expected clean behavior:

system instruction tokens -> loss=False
user prompt tokens        -> loss=False
anchor tokens             -> loss=False, if used as context
assistant answer tokens   -> loss=True

3. EOS behavior

Make sure the model learns to stop. If EOS handling is inconsistent, the model may continue into fake user turns, repeat the anchor, or generate additional examples.

4. Truncation

If long examples are truncated, the anchor may remain but the answer may be cut off. That can teach bad continuation behavior.

5. Packing

When debugging this phenomenon, disable packing at first. Packing can hide cross-example boundary problems.


Common mistakes

Mistake 1: thinking “cleaner” always means better

Removing the anchor made the dataset cleaner, but also removed a useful behavioral cue.

The right move is not to preserve the bug.
The right move is to convert the useful part into a clean system instruction.

Mistake 2: training the anchor as output by accident

If the anchor is in the completion, the model may learn to output it.

Mistake 3: using one exact anchor everywhere

One exact repeated phrase can create exact-trigger dependence.

Mistake 4: overusing “based on the shrimp dataset”

Use “when relevant.” Otherwise the model may force unrelated questions into the dataset frame.

Mistake 5: confusing structure with reasoning

A structured answer is not necessarily a reasoned answer. If you want reasoning behavior, your answer targets need to consistently demonstrate the reasoning structure.

Mistake 6: changing chat templates between training and inference

Wrong or mismatched chat templates can degrade results sharply. Use the model’s intended template and keep train/eval/inference consistent unless you are deliberately testing template dependence.

See:


My recommended next run

Use this production-style format:

{
  "messages": [
    {
      "role": "system",
      "content": "Provide precise, technical, and structured answers. Use the shrimp/aquarium dataset context when relevant. If the question is unrelated to that context, answer normally. Do not invent dataset-specific claims."
    },
    {
      "role": "user",
      "content": "<user_prompt>"
    },
    {
      "role": "assistant",
      "content": "<final_answer>"
    }
  ]
}

Then verify:

system/user tokens -> loss=False
assistant answer   -> loss=True

Also add mixed-domain and “not enough information” examples:

Question:
What is BacterAE, and what is Python?

Expected behavior:
Answer BacterAE using the domain context.
Answer Python as a normal programming-language question.
Do not force Python into the shrimp dataset frame.
Question:
What exact dose of BacterAE is optimal for every tank?

Expected behavior:
Say that the available context does not establish one universal dose.
Explain that dosing depends on tank size, stocking, filtration, and product instructions.

Bottom line

Your observation is real and plausible.

The repeated phrase probably worked because it was:

  • semantically aligned with the desired output
  • repeated in every example
  • placed immediately before the answer
  • acting as a delimiter
  • acting like a system prompt
  • possibly included in the loss, depending on your pipeline

So:

You are not fundamentally wrong to test Prompt -> Anchor -> Response.

But the production version should usually be:

<System instruction> + <User prompt> -> <Assistant answer>

not:

<User prompt> -> <accidental meta-cognitive sentence> -> <Assistant answer>

The most important practical check is:

Did the anchor tokens have label=-100?

If yes, the anchor was conditioning context.
If no, the model was trained to generate the anchor.

That distinction is the difference between a clean control prefix and a brittle template artifact.


Useful links

Core implementation:

Closest related papers:

Related concepts:

Practical warning category: