PTQ INT8 via TFLiteConverter — encoder-decoder seq2seq model loses encoder context entirely after conversion

I’m trying to deploy a seq2seq encoder-decoder model on an embedded target that only accepts INT8 TFLite models. The conversion via `TFLiteConverter` completes without errors, but the resulting model is completely broken at inference — suggesting the converter is not handling the encoder-decoder architecture correctly under full INT8 quantization.**

** Environment **

  • tensorflow 2.13, transformers 4.40
  • macOS (conversion) → embedded Linux with INT8 hardware delegate (inference)

Problem

Converting a fused encoder-decoder seq2seq model to INT8 using TFLiteConverter with the following setup:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_output_type = tf.float32

Conversion completes without errors, but the model generates repeated tokens for any input (BLEU drops from 23.9 to 0.04). The decoder stops using encoder context entirely from the first inference step.

EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8 is not viable — TILE op unsupported at runtime.

Question

Is this a known limitation of TFLiteConverter PTQ for encoder-decoder architectures? Is there a recommended calibration strategy or converter configuration for fused encoder-decoder graphs with cross-attention?

Open to any working approach to move forward.

Reproducible notebook available on request.

For now, it seems to be a known complicated failure mode:


Answer

Yes — this is a known failure class, but I would phrase it carefully.

I would not describe it as:

TFLiteConverter officially does not support encoder-decoder seq2seq PTQ.

That is too broad.

A more accurate statement is:

Full INT8 post-training quantization with TFLiteConverter is not a robust, well-documented deployment path for a fused autoregressive encoder-decoder Transformer graph. Conversion success only proves that the graph was lowered to a TFLite flatbuffer; it does not prove that encoder-decoder conditioning survived quantization.

In this case, the symptoms are much stronger than ordinary quantization degradation:

  • BLEU drops from 23.9 to 0.04.
  • The model emits repeated tokens for essentially any input.
  • The decoder appears to ignore the encoder from the first decoding step.
  • The INT16 activations / INT8 weights path is not deployable because the target runtime rejects TILE.

That combination strongly suggests that full INT8 PTQ has damaged the encoder-memory / decoder cross-attention path. The converted model is structurally valid, but semantically broken.


Why conversion success is misleading

TFLite conversion answers a graph-lowering question:

Can this TensorFlow graph be represented as a TFLite model using the requested operator set?

It does not answer the more important deployment question:

Does the quantized model preserve the numerical behavior required for autoregressive seq2seq generation?

The LiteRT/TFLite full-integer quantization path uses a representative dataset to estimate ranges for variable tensors such as model inputs, outputs, and intermediate activations. See:

For image classifiers, “representative data” often means representative images. For seq2seq generation, that is not enough. The representative dataset must exercise the real generation states:

  • source token lengths,
  • source attention masks,
  • decoder prefix lengths,
  • decoder masks,
  • forced BOS / language tokens,
  • early decoding,
  • middle decoding,
  • near-EOS decoding,
  • cross-attention activation ranges,
  • final-logit ranges.

If the converter only calibrates a narrow graph path, it can choose bad INT8 scales for tensors that are critical during real decoding.

That is how you can get:

conversion succeeds
+
runtime does not crash
+
outputs are completely wrong

Why this looks like cross-attention failure

An encoder-decoder Transformer has two main parts:

source tokens
→ encoder
→ encoder hidden states
→ decoder cross-attention
→ decoder hidden states
→ logits
→ output tokens

The decoder uses the source input mainly through cross-attention. If that path is corrupted, the decoder still has:

  • target-side token embeddings,
  • decoder self-attention,
  • learned language-model priors,
  • LM-head bias,
  • BOS / forced-token priors,
  • common-token frequency bias.

So the model can still generate tokens. But the output becomes weakly conditioned or unconditioned. Typical symptoms are:

  • same-ish output for different inputs,
  • repeated tokens,
  • generic high-priority tokens,
  • early collapse,
  • near-zero BLEU,
  • first-step logits that barely change across source inputs.

That matches the described behavior.

The most suspicious region is:

encoder_hidden_states
→ cross-attention key/value projections
→ attention score/value path
→ cross-attention output projection
→ residual / LayerNorm-adjacent tensors
→ LM-head logits

The first decoding step is especially diagnostic. At step 1, the decoder has almost no target-side history. If the INT8 model is already source-insensitive at step 1, the problem is probably not beam search, repetition penalty, EOS handling, or long-run generation logic. It is likely the encoder-memory path or the first decoder cross-attention block.


Is this a known limitation?

In practical terms, yes.

The exact sentence “TFLiteConverter PTQ does not support encoder-decoder seq2seq” is not the usual official wording. But the documented pieces line up:

  • TFLite full-integer PTQ depends on representative activation calibration: LiteRT post-training integer quantization.
  • TFLite provides a Quantization Debugger specifically because full-integer quantization can produce unexpectedly poor or completely wrong results.
  • Hugging Face’s Optimum TFLite exporter overview lists mostly encoder-style architectures such as BERT, RoBERTa, DistilBERT, MobileBERT, MPNet, and related models. It does not present full autoregressive encoder-decoder generation as the obvious happy path.
  • Optimum’s TFLite export guide notes that static input shapes need to be specified.
  • Hugging Face’s Optimum ONNX export docs describe encoder-decoder export using separate encoder and decoder pieces, because the encoder runs once and the decoder runs repeatedly during autoregressive generation.
  • ONNX Runtime’s quantization guide says dynamic quantization is generally recommended for RNNs and Transformer-based models, while static quantization is generally recommended for CNNs.

That last point is especially relevant. Your hardware requires a static full-INT8-style artifact, but Transformer generation is one of the model families where static activation calibration is most fragile.

So the practical answer is:

This is a known class of PTQ failure: a valid full-INT8 TFLite model can be generated, but the quantized activations can destroy the conditioning path that makes encoder-decoder generation work.


Why Transformers are hard for generic INT8 PTQ

Transformer quantization is difficult mainly because the activations are difficult.

The literature around Transformer quantization repeatedly points to activation outliers and attention/LayerNorm sensitivity:

Plain TFLiteConverter PTQ is much more generic than these methods. It does not automatically perform SmoothQuant-style activation smoothing, LLM.int8-style outlier routing, or I-BERT-style integer Transformer operator redesign.

That matters because a fused encoder-decoder generation graph contains exactly the fragile pieces:

MatMul / BatchMatMul
Softmax
LayerNorm-adjacent tensors
residual additions
attention masks
cross-attention K/V projections
final vocabulary projection

A single bad scale around cross-attention can make the decoder appear source-blind.


Why the fused graph is probably the wrong deployment shape

A fused encoder-decoder generation graph is the least debuggable shape for this problem.

Seq2seq inference naturally looks like this:

1. Run encoder once.
2. Repeatedly run decoder for each output token.
3. Select the next token outside the model.
4. Stop on EOS or max length.

The usual deployment structure is therefore:

encoder model:
  input_ids, attention_mask
  → encoder_hidden_states

decoder-step model:
  decoder_input_ids, decoder_attention_mask, encoder_hidden_states, encoder_attention_mask
  → next-token logits

Then the host application runs greedy search, beam search, EOS handling, and repetition logic outside the model.

This is also the shape used by common seq2seq export/deployment tooling. For example, Hugging Face’s Optimum ONNX export guide discusses decoder export with past key/value reuse because the decoder runs repeatedly during autoregressive generation.

A fused graph often hides too much:

encoder
decoder
decoder loop
mask updates
shape operations
possibly beam expansion
possibly TILE
token selection
EOS handling

That makes all of these harder:

  • calibration,
  • static-shape control,
  • operator support,
  • delegate partitioning,
  • cross-attention inspection,
  • first-step source-sensitivity testing,
  • quantized boundary debugging.

For this case, I would not keep pushing the fused graph as the primary production path.


Recommended working approach

The most realistic path forward is:

encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop

Do not export generate() as one fused TFLite graph unless there is no alternative.

Target layout

encoder.tflite

inputs:
  input_ids: int32
  attention_mask: int32

outputs:
  encoder_hidden_states: int8
decoder_step.tflite

inputs:
  decoder_input_ids: int32
  decoder_attention_mask: int32
  encoder_hidden_states: int8
  encoder_attention_mask: int32

outputs:
  logits: int8

Host-side decoding:

encoder_states = run_encoder(input_ids, attention_mask)

decoder_ids = [decoder_start_token_id]

for step in range(max_new_tokens):
    logits = run_decoder_step(
        decoder_input_ids=decoder_ids,
        decoder_attention_mask=make_decoder_mask(decoder_ids),
        encoder_hidden_states=encoder_states,
        encoder_attention_mask=attention_mask,
    )

    next_id = select_next_token(logits)
    decoder_ids.append(next_id)

    if next_id == eos_token_id:
        break

This structure gives you a way to test each boundary:

FP32 encoder → FP32 decoder
INT8 encoder → FP32 decoder
FP32 encoder → INT8 decoder
INT8 encoder → INT8 decoder
INT8 encoder → INT8 decoder on hardware delegate

That isolates whether the failure comes from:

  • encoder quantization,
  • decoder quantization,
  • the encoder-output / decoder-input boundary,
  • cross-attention,
  • logits,
  • or the delegate.

The hard part: quantized encoder/decoder boundary

If you split the graph, the encoder output and decoder input may have different quantization parameters.

Example:

encoder output:
  scale_e
  zero_point_e

decoder encoder_hidden_states input:
  scale_d
  zero_point_d

You cannot blindly pass raw INT8 bytes from the encoder output into the decoder input unless the quantization parameters match.

If they differ, you need an explicit requantization bridge:

real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)

This boundary is important. A broken boundary can produce exactly the same symptom as broken cross-attention: the decoder runs but receives meaningless encoder memory.

For debugging, temporarily test these variants:

FP32 encoder output → FP32 decoder
INT8 encoder output → dequantized float → FP32 decoder
FP32 encoder output → quantized decoder input → INT8 decoder
INT8 encoder output → requantized decoder input → INT8 decoder

Only the final variant is close to strict deployment, but the intermediate variants tell you where the information is lost.


Calibration strategy

The representative dataset must cover actual generation states.

Do not calibrate only source inputs.

Do not calibrate only BOS.

Do not calibrate only full teacher-forced targets if deployment uses step-by-step decoding.

A better calibration set should include multiple decoder prefixes per source example.

Bad calibration pattern

def representative_dataset():
    for batch in source_batches:
        yield {
            "input_ids": batch["input_ids"],
            "attention_mask": batch["attention_mask"],
        }

That may calibrate the encoder path but not the decoder cross-attention behavior used during generation.

Better calibration pattern

def representative_dataset():
    for src_text, tgt_text in calibration_pairs:
        src = source_tokenizer(
            src_text,
            max_length=SRC_LEN,
            padding="max_length",
            truncation=True,
            return_tensors="np",
        )

        tgt = target_tokenizer(
            tgt_text,
            max_length=TGT_LEN,
            padding=False,
            truncation=True,
            return_tensors="np",
        )

        target_ids = tgt["input_ids"][0]

        for prefix_len in [1, 2, 4, 8, 16, 32]:
            if prefix_len > len(target_ids):
                continue

            decoder_prefix = target_ids[:prefix_len]
            decoder_prefix = pad_to_length(
                decoder_prefix,
                length=DECODER_PREFIX_LEN,
                pad_id=target_pad_id,
            )

            yield {
                "input_ids": src["input_ids"].astype("int32"),
                "attention_mask": src["attention_mask"].astype("int32"),
                "decoder_input_ids": decoder_prefix[None, :].astype("int32"),
                "decoder_attention_mask": (decoder_prefix[None, :] != target_pad_id).astype("int32"),
            }

The exact input names must match the SavedModel signature.

Calibration coverage checklist

Include:

short source examples
normal source examples
long source examples
max-length source examples
padding-heavy examples
near-no-padding examples
rare names and numerals
punctuation-heavy examples
domain-specific examples
BOS / forced decoder-start token
early decoder prefix
middle decoder prefix
near-EOS decoder prefix

A useful rule of thumb:

200 source examples × 5 decoder prefixes

is usually more informative than:

1000 source examples × only BOS

because the former covers more activation regimes.


Converter configuration advice

There is probably no single converter flag that fixes this.

Still, I would run these baselines.

1. Float TFLite baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
tflite_model = converter.convert()

with open("<model_float>.tflite", "wb") as f:
    f.write(tflite_model)

If this fails, stop. The issue is export/lowering, not INT8.

2. Dynamic-range baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

tflite_model = converter.convert()

with open("<model_dynamic_range>.tflite", "wb") as f:
    f.write(tflite_model)

If dynamic-range quantization works while full INT8 fails, weights are probably not the main problem. The problem is activation quantization.

3. Full INT8 baseline

converter = tf.lite.TFLiteConverter.from_saved_model("<saved_model_dir>")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

tflite_model = converter.convert()

with open("<model_int8>.tflite", "wb") as f:
    f.write(tflite_model)

Be careful with:

converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

For text models, token IDs are categorical integer indices, not numeric image/audio activations. input_ids and masks often remain int32. Do not blindly force token IDs to INT8.

Always inspect the converted model:

interpreter = tf.lite.Interpreter(model_path="<model_int8>.tflite")
interpreter.allocate_tensors()

print("Inputs:")
for item in interpreter.get_input_details():
    print(item["name"], item["dtype"], item["shape"], item["quantization"])

print("Outputs:")
for item in interpreter.get_output_details():
    print(item["name"], item["dtype"], item["shape"], item["quantization"])

If the final logits are INT8, the host decoder must respect the output tensor’s scale and zero point.

For greedy argmax, quantized argmax is often equivalent if all logits share one scale and zero point. For beam search, length penalty, temperature, top-k, or probability arithmetic, dequantization or careful fixed-point handling is safer.


About inference_output_type=tf.float32

This line is suspicious:

converter.inference_output_type = tf.float32

It is not necessarily the root cause of the collapse, but it is worth testing without it.

If the target is a strict INT8 hardware delegate, leaving a float output can create an awkward quantize/dequantize boundary or a partially non-integer interface. That may be acceptable for debugging, but it is not ideal for a strict integer deployment.

However, the repeated-token collapse is more likely caused by an internal activation/cross-attention quantization problem than by the output type alone.

I would test both:

# Debug-friendly interface
converter.inference_output_type = tf.float32

and:

# Strict integer numeric output, if compatible with your graph interface
converter.inference_output_type = tf.int8

Then compare:

full INT8 CPU output
full INT8 delegate output
first-step source sensitivity
BLEU

Why 16x8 is useful even though it is not deployable

The experimental mode:

tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8

is useful diagnostically because it tests whether INT8 activations are the problem.

If 16x8 improves quality but the runtime rejects TILE, the interpretation is:

The model likely needs more activation precision.
The target delegate cannot execute the more accurate path.

LiteRT documents 16-bit activations with 8-bit weights as an option that can help when activations are sensitive, but optimized kernel/delegate support is more limited than ordinary INT8. See:

So the TILE problem is not surprising. It is a runtime/delegate support failure, not proof that plain INT8 PTQ should work.


The first diagnostic test I would run

Before doing more BLEU evaluation, run a first-step source-sensitivity test.

Pick two very different source sentences:

source A: "The committee approved the budget after three hours of debate."
source B: "The patient developed a fever after the second injection."

Use the same decoder prefix:

decoder_input_ids = [decoder_start_token_id]

Compare:

FP32(source A, BOS) → logits_A_fp32
FP32(source B, BOS) → logits_B_fp32

INT8(source A, BOS) → logits_A_int8
INT8(source B, BOS) → logits_B_int8

Healthy behavior:

FP32 logits differ by source.
INT8 logits also differ by source.

Broken source-blind behavior:

FP32 logits differ by source.
INT8 logits are nearly identical across sources.

Example helper:

import numpy as np

def topk_ids(logits, k=10):
    flat = np.asarray(logits).reshape(-1)
    return np.argsort(flat)[-k:][::-1]

def compare_logits(logits_a, logits_b, k=10):
    a = np.asarray(logits_a).reshape(-1).astype(np.float64)
    b = np.asarray(logits_b).reshape(-1).astype(np.float64)

    top_a = topk_ids(a, k)
    top_b = topk_ids(b, k)

    cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)

    return {
        "argmax_a": int(top_a[0]),
        "argmax_b": int(top_b[0]),
        "same_argmax": bool(top_a[0] == top_b[0]),
        "topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
        "cosine": float(cosine),
        "range_a": float(a.max() - a.min()),
        "range_b": float(b.max() - b.min()),
        "top_a": top_a.tolist(),
        "top_b": top_b.tolist(),
    }

This is more informative than BLEU.

BLEU tells you the model is broken. First-step source sensitivity tells you whether the encoder context is already gone at the first decoder step.


Use Quantization Debugger

Use TFLite’s Quantization Debugger to identify where error first explodes:

Inspect these tensors first:

encoder output
decoder layer 0 cross-attention query
decoder layer 0 cross-attention key
decoder layer 0 cross-attention value
decoder layer 0 cross-attention scores
decoder layer 0 cross-attention output
post-cross-attention residual
final decoder hidden state
LM-head logits

Look for:

Observation Likely meaning
Encoder hidden states saturated Encoder output quantization is bad
Cross-attention K/V nearly constant Source memory is destroyed
Attention scores nearly constant Decoder cannot select source positions
Attention scores extreme Softmax collapses
Cross-attention output near zero Source signal is muted
Residual dominates attention output Encoder signal is drowned
Logits almost identical across sources Decoder is source-blind
Logits saturated Final projection/output scale problem

Selective quantization can also be useful diagnostically. For example, leave one region float and see whether BLEU recovers:

leave encoder output float
leave cross-attention K/V projections float
leave attention score path float
leave post-cross-attention residual float
leave LM head float

This may not be deployable on an INT8-only delegate, but it can identify the tensor group that kills the model.


Full diagnostic ladder

Run the same evaluation set through these variants:

Variant Purpose Interpretation
Original FP32 TensorFlow / Transformers Reference Should reproduce BLEU around 23.9
Float TFLite CPU Export/lowering check If bad, quantization is not the first problem
Dynamic-range TFLite CPU Weight-quantization check If good, weights are not the main issue
Full INT8 TFLite CPU Quantization check If bad, calibration/numerics are failing
Full INT8 TFLite delegate Runtime check If CPU good but delegate bad, runtime/delegate is failing
16x8 TFLite CPU, if possible Activation-precision check If better, INT8 activations are the bottleneck

The key split is:

float TFLite bad
→ export/lowering/fused-graph issue

float TFLite good, INT8 CPU bad
→ quantization/calibration issue

INT8 CPU good, INT8 delegate bad
→ delegate/operator/kernel issue

16x8 better than INT8
→ activation precision issue

What to do if split PTQ still fails

If the split encoder/decoder-step model still collapses after proper calibration, the realistic options are:

1. Quantization-aware training

Use QAT if PTQ cannot meet the accuracy target.

Relevant docs:

Important: do QAT on the deployment-shaped graph, not only on the original training graph.

That means:

same max source length
same decoder-step shape
same masks
same BOS/EOS behavior
same tokenizer
same target delegate constraints
same quantized encoder/decoder boundary

2. Distillation into a quantization-friendly model

If the original architecture is too sensitive, distill into a smaller model designed for the target constraints:

fixed source length
fixed decoder-step shape
simpler attention pattern
no fused generation graph
delegate-supported ops only
QAT or PTQ-aware evaluation from the beginning

3. Runtime change, if possible

If the target can change, use a Transformer-native runtime instead of generic fused TFLite.

Useful references:

CTranslate2 supports many encoder-decoder Transformer families and quantization modes. Even if it cannot be shipped on the final target, it is useful as a sanity check:

If CTranslate2 INT8 works but TFLite INT8 collapses,
the model is probably quantizable,
but the current TFLite path is not preserving it.

4. Requirement change

If the hardware delegate truly requires plain full INT8 TFLite and the model cannot survive that path, the requirement may be incompatible with the model family.

Possible requirement changes:

allow int16 activations
allow selected float fallback
allow a custom op
allow a different runtime
allow a smaller/distilled model
allow server-side inference

What I would not spend time on

Blindly adding more calibration samples

More data does not fix the wrong calibration distribution.

Bad:

1000 source examples × BOS only

Better:

200 source examples × multiple decoder prefixes

Blindly trying converter flags

Converter flags are secondary. The primary issue is graph shape and activation calibration.

Assuming TILE is the root cause of BLEU collapse

TILE explains why the 16x8 path is not viable on the target. It does not by itself explain why full INT8 repeats tokens. These are related deployment constraints, but not the same failure.

Assuming the converter understands generation semantics

The converter lowers tensors and ops. It does not know that a certain tensor is “encoder memory that must preserve source conditioning.”


My final recommendation

I would proceed like this:

  1. Do not keep the fused graph as the main production candidate.
  2. Build a float TFLite baseline and verify it matches the original model.
  3. Build a dynamic-range TFLite baseline.
  4. Split into:
    • encoder.tflite
    • decoder_step.tflite
  5. Calibrate the decoder-step model using real decoder prefixes across multiple timesteps.
  6. Run full INT8 CPU before using the hardware delegate.
  7. Run the first-step source-sensitivity test.
  8. Use Quantization Debugger around encoder output and decoder cross-attention.
  9. Explicitly handle the quantized encoder-output / decoder-input boundary.
  10. If PTQ still collapses, move to QAT or distillation.
  11. If CPU INT8 works but the delegate fails, treat it as a delegate/operator support problem.

The concise diagnosis is:

The converted model is probably not failing because TFLite cannot tokenize, decode, or run the graph at all. It is failing because full INT8 static PTQ has destroyed the numerical path that carries encoder information into decoder cross-attention. The decoder still emits tokens, but it no longer receives useful source context, so it falls back to repeated high-prior tokens and BLEU collapses.


Short answer

  • Yes, this is a known full-INT8 PTQ failure class.
  • No, there is probably not one converter flag that fixes a fused encoder-decoder generation graph.
  • The likely broken region is cross-attention or the encoder-hidden-state boundary.
  • The recommended deployment shape is split encoder + decoder-step, with generation outside TFLite.
  • Calibration must include real decoder prefixes across timesteps, not only source inputs.
  • Use float TFLite, dynamic-range TFLite, full INT8 CPU, and full INT8 delegate as separate baselines.
  • Use first-step source-sensitivity tests and Quantization Debugger before relying only on BLEU.
  • If careful split PTQ still fails, use QAT, distillation, or a different runtime/precision target.

Hi,

Thanks a lot for your initial response, it pointed me in the right direction.

Quick update on what I’ve found after several weeks of testing:

Confirmed: PTQ INT8 via TFLiteConverter is indeed broken on the decoder side of seq2seq Transformer architectures. I reproduced the issue on two separate models (MarianMT Helsinki-NLP/opus-mt-en-fr and t5-small), with the same symptom: the encoder converts cleanly to INT8, but the decoder produces garbage outputs (random tokens, empty strings, or nonsensical translations). FP32 works perfectly on both.

The root cause appears to be miscalibrated quantization scales on the cross-attention layers, the representative dataset only sees encoder inputs, so the decoder’s activations are never properly calibrated.

I’m now exploring QAT as a potential fix, but I’m hitting a wall on the TFLite side specifically, most documented success stories with optimum + ONNX Runtime work on CPU, but the TFLite export path for seq2seq remains largely undocumented.

If anyone has successfully deployed a quantized seq2seq Transformer to TFLite (not ONNX Runtime), especially on a custom hardware delegate, I’d love to hear about it.

Thanks again.

I can’t find a single real-world example of this working “as-is” through a search…


Are there any real solutions for full-INT8 TFLite seq2seq Transformer deployment?

Short answer: yes, but not as a simple TFLiteConverter flag.

For a Hugging Face-style encoder-decoder Transformer such as T5, MarianMT, BART, mBART, Pegasus, M2M100, or NLLB, the realistic solution is not:

converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]

and done.

That path can produce a valid .tflite file while the decoder becomes numerically useless. The likely reason is that the decoder’s cross-attention path is not being calibrated correctly. The encoder can quantize cleanly, while the decoder loses source conditioning and starts producing repeated tokens, random tokens, empty strings, or nonsensical translations.

The most realistic path is:

encoder_int8.tflite
+
decoder_step_int8.tflite
+
host-side generation loop
+
explicit decoder calibration
+
explicit encoder→decoder quantized boundary handling
+
Quantization Debugger
+
possibly decoder-step QAT or a custom attention delegate

There is probably no public turnkey recipe for this exact target today.


Current state of the field

I would summarize the situation like this:

Full-INT8 TFLite deployment of a Hugging Face-style encoder-decoder Transformer decoder is not a mature public path. There are good public resources for TFLite INT8 in general, good public resources for ONNX/CTranslate2 seq2seq deployment, and good research on Transformer quantization. But I could not find a validated public example of a T5/MarianMT/BART-style encoder.tflite + decoder_step.tflite full-INT8 deployment with working decoder cross-attention and custom delegate execution.

Useful references:


Why the fused PTQ path fails

A fused seq2seq graph hides too much.

A seq2seq Transformer naturally runs like this:

1. Run encoder once.
2. Repeatedly run decoder for each generated token.
3. Select the next token outside the model.
4. Stop on EOS or max length.

The decoder uses the source through cross-attention:

decoder hidden state → Q
encoder hidden state → K, V

attention_scores = Q @ K.T
attention_probs = softmax(attention_scores + mask)
context = attention_probs @ V

If INT8 quantization corrupts this path, the decoder can still emit tokens because it still has:

  • decoder token embeddings,
  • decoder self-attention,
  • learned language-model priors,
  • LM-head bias,
  • forced BOS/language-token priors.

But it no longer receives useful source information. That produces:

same-ish output for unrelated sources
repeated tokens
empty strings
random tokens
nonsensical translations
BLEU collapse

That is not ordinary quantization loss. That is source-conditioning failure.


Why encoder-only calibration is insufficient

TFLite full integer quantization depends on representative data to calibrate activation ranges.

For a decoder, representative data must cover the decoder state distribution, not only source inputs.

Bad calibration:

def representative_dataset():
    for source in sources:
        yield {
            "input_ids": source["input_ids"],
            "attention_mask": source["attention_mask"],
        }

That mostly calibrates the encoder path.

A decoder needs calibration samples like:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

and those samples must represent real generation states:

BOS-only prefix
early target prefix
middle target prefix
near-EOS prefix
short source
long source
padding-heavy source
near-no-padding source
names / numbers / rare tokens
domain-specific examples

A better calibration strategy is:

200 source examples × 5 decoder prefixes

not:

1000 source examples × encoder only

The issue is not just dataset size. It is whether the decoder cross-attention tensors are ever exercised with realistic activation ranges.


Solution 1: Split the graph

The first serious solution is to stop trying to deploy the fused graph.

Do not make this the production target:

fused_seq2seq_int8.tflite

Use this instead:

encoder_int8.tflite
decoder_step_int8.tflite
host_generation_loop

This matches the architecture used by mature seq2seq export flows. Hugging Face Optimum’s ONNX path explicitly handles encoder-decoder generation by separating encoder and decoder behavior, including decoder past-key-value reuse for autoregressive generation:

Target layout:

encoder_int8.tflite

inputs:
  input_ids: int32
  attention_mask: int32

outputs:
  encoder_hidden_states: int8
decoder_step_int8.tflite

inputs:
  decoder_input_ids: int32
  decoder_attention_mask: int32
  encoder_hidden_states: int8
  encoder_attention_mask: int32

outputs:
  logits: int8

Host-side generation:

encoder_states = run_encoder(input_ids, attention_mask)

decoder_ids = [decoder_start_token_id]

for step in range(max_new_tokens):
    logits = run_decoder_step(
        decoder_input_ids=decoder_ids,
        decoder_attention_mask=make_decoder_mask(decoder_ids),
        encoder_hidden_states=encoder_states,
        encoder_attention_mask=attention_mask,
    )

    next_id = select_next_token(logits)
    decoder_ids.append(next_id)

    if next_id == eos_token_id:
        break

This does not automatically fix quantization, but it makes the problem debuggable.


Solution 2: Build decoder-specific representative data

The decoder representative dataset must feed the decoder signature directly.

Conceptual decoder calibration:

def representative_decoder_dataset():
    for src_text, tgt_text in calibration_pairs:
        encoder_inputs = tokenize_source(src_text)

        # For debugging:
        #   Use FP32 encoder states.
        #
        # For deployment fidelity:
        #   Use quantized encoder states plus the real encoder→decoder requantization bridge.
        encoder_hidden_states = run_encoder_for_calibration(encoder_inputs)

        target_ids = tokenize_target(tgt_text)

        for prefix_len in [1, 2, 4, 8, 16, 32]:
            if prefix_len > len(target_ids):
                continue

            prefix = target_ids[:prefix_len]
            prefix = pad_to_static_length(prefix, DECODER_LEN)

            yield {
                "decoder_input_ids": prefix.astype("int32"),
                "decoder_attention_mask": make_decoder_mask(prefix).astype("int32"),
                "encoder_hidden_states": encoder_hidden_states,
                "encoder_attention_mask": encoder_inputs["attention_mask"].astype("int32"),
            }

If your SavedModel has multiple signatures, the representative dataset can conceptually be split by signature:

def representative_dataset():
    for batch in encoder_calibration_batches:
        yield (
            "encode",
            {
                "input_ids": batch["input_ids"],
                "attention_mask": batch["attention_mask"],
            },
        )

    for batch in decoder_calibration_batches:
        yield (
            "decode",
            {
                "decoder_input_ids": batch["decoder_input_ids"],
                "decoder_attention_mask": batch["decoder_attention_mask"],
                "encoder_hidden_states": batch["encoder_hidden_states"],
                "encoder_attention_mask": batch["encoder_attention_mask"],
            },
        )

Relevant docs:

The key idea:

The decoder must be calibrated as a decoder, not as a side effect of encoder input calibration.


Solution 3: Handle the encoder→decoder quantized boundary

If the encoder and decoder are separate TFLite models, the boundary can break the model even if both models are individually valid.

The encoder output and decoder input may have different quantization parameters:

encoder output:
  scale_e
  zero_point_e

decoder encoder_hidden_states input:
  scale_d
  zero_point_d

You cannot blindly pass raw int8 bytes from the encoder output into the decoder input unless the quantization parameters match.

If they differ, requantize:

real_value = scale_e * (q_e - zero_point_e)
q_d = round(real_value / scale_d + zero_point_d)
q_d = clamp(q_d, -128, 127)

Deployment-style boundary test matrix:

Encoder Boundary Decoder Meaning
FP32 float FP32 Split-graph reference
INT8 dequantized float FP32 Tests encoder quality
FP32 quantized to decoder input INT8 Tests decoder quality
INT8 requantized INT8 Full deployment-like path

If this boundary is wrong, the symptom can look exactly like broken cross-attention:

decoder runs
but receives meaningless encoder memory

Solution 4: Move cross-attention K/V projection to the encoder side

This is an architecture-level workaround.

Normally, each decoder layer computes K/V from encoder hidden states:

K_i = W_k_i(encoder_hidden_states)
V_i = W_v_i(encoder_hidden_states)

Instead, make the encoder-side artifact produce precomputed cross-attention memory:

encoder_int8.tflite

outputs:
  cross_k_layer_0
  cross_v_layer_0
  cross_k_layer_1
  cross_v_layer_1
  ...

Then make the decoder consume those tensors directly:

decoder_step_int8.tflite

inputs:
  decoder_input_ids
  decoder_attention_mask
  cross_k_layer_0
  cross_v_layer_0
  cross_k_layer_1
  cross_v_layer_1
  ...

Why this can help:

  • K/V are computed once, not every decoder step.
  • K/V become explicit graph outputs/inputs.
  • You can inspect their quantization parameters directly.
  • You can design the encoder→decoder boundary around K/V instead of generic hidden states.
  • The decoder graph becomes more predictable.

Tradeoff:

num_decoder_layers × 2 tensors

You get more interface complexity, but also much more control.

This is one of the most promising workarounds if the failure is specifically cross-attention K/V scale mismatch.


Solution 5: Use first-step source-sensitivity testing

Before relying on BLEU, test whether the decoder still sees the source.

Use two unrelated inputs:

source A: The committee approved the budget after a long debate.
source B: The patient developed a fever after the second injection.
decoder prefix: decoder_start_token_id

Compare first-step logits:

FP32(source A, BOS) vs FP32(source B, BOS)
INT8(source A, BOS) vs INT8(source B, BOS)

Healthy behavior:

FP32 logits differ across sources.
INT8 logits also differ across sources.

Broken behavior:

FP32 logits differ across sources.
INT8 logits are nearly identical across sources.

Minimal helper:

import numpy as np

def topk_ids(logits, k=10):
    flat = np.asarray(logits).reshape(-1)
    return np.argsort(flat)[-k:][::-1]

def compare_logits(logits_a, logits_b, k=10):
    a = np.asarray(logits_a).reshape(-1).astype(np.float64)
    b = np.asarray(logits_b).reshape(-1).astype(np.float64)

    top_a = topk_ids(a, k)
    top_b = topk_ids(b, k)

    cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)

    return {
        "argmax_a": int(top_a[0]),
        "argmax_b": int(top_b[0]),
        "same_argmax": bool(top_a[0] == top_b[0]),
        "topk_overlap": len(set(top_a.tolist()) & set(top_b.tolist())),
        "cosine": float(cosine),
        "range_a": float(a.max() - a.min()),
        "range_b": float(b.max() - b.min()),
        "top_a": top_a.tolist(),
        "top_b": top_b.tolist(),
    }

This test is more diagnostic than BLEU.

BLEU tells you output quality is bad. First-step source sensitivity tells you whether the decoder lost encoder context immediately.


Solution 6: Use Quantization Debugger and selective rescue

Use the official Quantization Debugger to locate the first catastrophic tensor.

Relevant docs:

Start with decoder layer 0:

decoder embedding output
decoder self-attention output
cross-attention Q
cross-attention K
cross-attention V
QK^T attention scores
attention probabilities
attention_probs @ V context
cross-attention output projection
post-cross-attention residual
LM-head logits

Interpretation table:

Observation Likely cause
K/V nearly constant Encoder memory destroyed
Q/K scales incompatible Dot product corrupted
Attention scores flat Source selection lost
Attention scores extreme Softmax collapse
Context vector near zero Cross-attention muted
Residual dominates context Source signal drowned
Logits same across source inputs Decoder source-blind
Logits saturated Output scale problem

Selective quantization is useful diagnostically:

leave K/V projections float
leave QK^T score path float
leave Softmax path float
leave cross-attention output projection float
leave post-cross-attention residual float
leave LM head float

If leaving a region float restores BLEU, that region is the failure point.

This may not be deployable on a strict INT8 delegate, but it tells you what must be fixed.


Solution 7: Decoder-step QAT

If explicit decoder PTQ still fails, QAT is the next real TFLite-native option.

Relevant docs:

Do not begin with the fused generation graph.

Begin with:

decoder_step_qat_model

Inputs:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

Target:

next target token

Training objective:

teacher-forced next-token prediction

Prefix sampling:

BOS
BOS + token 1
BOS + tokens 1..3
middle prefix
near-EOS prefix

The QAT graph must match deployment:

same source length
same decoder prefix length
same masks
same decoder_start_token_id behavior
same encoder_hidden_states boundary
same logits output convention
same supported operator set

Important caveat:

Transformer attention QAT in TensorFlow/TFLite is not necessarily turnkey.

There are public issues around QAT support for MultiHeadAttention, which is a warning that you may need a custom Keras decoder-step implementation, custom QuantizeConfig, or manual fake-quant insertion.

Relevant issue:

Possible implementation routes:

custom decoder-step Keras model
custom QuantizeConfig
manual FakeQuant insertion
rewrite attention into quantizable primitives
train a smaller deployment-specific decoder

Solution 8: Custom delegate or custom op for quantized cross-attention

If you own the hardware delegate, the most robust engineering solution may be to stop relying on generic TFLite decomposition for attention.

Implement quantized cross-attention as a delegate-supported fused subgraph or custom op.

A real quantized cross-attention implementation needs to control:

Q projection scale
K projection scale
V projection scale
QK^T accumulation scale
mask representation
Softmax approximation range
attention_probs scale
attention_probs @ V accumulation
context output scale
output projection scale
residual merge scale

This is much harder than “support INT8 matmul.”

Attention contains:

FULLY_CONNECTED
RESHAPE
TRANSPOSE
BATCH_MATMUL
ADD / mask
SOFTMAX
BATCH_MATMUL
FULLY_CONNECTED
ADD / residual
possibly LayerNorm-adjacent behavior

Relevant public warnings:

If a hardware vendor says “we support INT8 matrix multiplication,” that is not enough. Cross-attention requires correct scale propagation through the whole attention block.


Solution 9: Allow a precision exception if possible

If product constraints can change, the most natural accuracy fix is:

INT8 weights
+
INT16 or float activations for attention-sensitive paths

LiteRT documents a 16x8 mode:

This can help when activations are sensitive to quantization, but runtime/delegate support is often limited.

If 16x8 improves quality but fails due to TILE or another unsupported op, the diagnostic meaning is still useful:

The model probably needs more activation precision.
The current delegate cannot execute the more accurate path.

Possible compromise:

INT8 encoder
INT8 FFN/projections
INT16 or float cross-attention score path
INT8 output projection

This is not pure full-INT8, but it is often closer to what Transformer quantization actually needs.


Solution 10: Distill or redesign the model for the target

If full-INT8 TFLite is absolutely mandatory and QAT/custom delegate work is too expensive, the best product path may be to change the model.

Options:

smaller encoder-decoder Transformer
fewer decoder layers
smaller hidden size
shorter max source length
fixed decoder-step window
reduced vocabulary
domain-specific translation model
non-autoregressive model if task allows
RNN/Conv seq2seq model if task allows

Train with deployment constraints from the beginning:

static shapes
teacher-forced decoder-step training
QAT during fine-tuning
delegate-supported ops only
fixed source length
fixed decoder step shape

This is less elegant, but often more robust than trying to force a general-purpose pretrained Transformer decoder into a strict embedded INT8 delegate.


Solution 11: Change runtime if allowed

If TFLite is negotiable, use a Transformer-native runtime.

CTranslate2

CTranslate2 supports many encoder-decoder Transformer families and multiple quantization modes.

Useful links:

This is the easiest way to answer:

Can this model family be quantized usefully at all?

If CTranslate2 INT8 works while TFLite INT8 fails, then the model is not inherently unquantizable. The TFLite path is the issue.

ONNX Runtime

ONNX Runtime has a more mature Transformer quantization story than TFLite for many workloads.

Useful links:

Important caveat:

ONNX Runtime success does not prove full-INT8 TFLite will work.

ONNX Runtime docs generally recommend dynamic quantization for Transformer-based models, while your target requires static full-INT8 behavior. Those are different deployment regimes.


Recommended execution plan

If TFLite is mandatory, I would do this in order.

Step 1: Build split FP32 TFLite

Create:

encoder_fp32.tflite
decoder_step_fp32.tflite

Verify:

split FP32 output ≈ original Transformers output

Do not quantize until this works.


Step 2: Quantize encoder only

Create:

encoder_int8.tflite
decoder_step_fp32.tflite

If quality remains good, the encoder is not the blocker.


Step 3: Quantize decoder with decoder-specific calibration

Create:

decoder_step_int8.tflite

Use representative samples with:

decoder_input_ids
decoder_attention_mask
encoder_hidden_states
encoder_attention_mask

Test:

FP32 encoder + INT8 decoder
INT8 encoder + INT8 decoder

Step 4: Test source sensitivity

Compare first-step logits for two unrelated source sentences.

If INT8 logits are nearly identical, the decoder is source-blind.


Step 5: Debug cross-attention tensors

Use Quantization Debugger around:

Q
K
V
QK^T
Softmax
context
residual
LM head

Find the first catastrophic divergence.


Step 6: Apply one targeted rescue

Failure location Targeted fix
Encoder output boundary Explicit requantization bridge
K/V projections Move K/V projection to encoder side
QK score path Custom scale handling or higher precision
Softmax Custom op/delegate or precision exception
Residual merge QAT or scale control
LM head Better calibration or QAT

Step 7: Try decoder-step QAT

Use teacher-forced target prefixes.

Do not start with the fused model.


Step 8: Validate CPU INT8 before delegate

If CPU INT8 fails, the model is still quantization-broken.

If CPU INT8 works and delegate fails, the problem is delegate support.


What not to do

Do not keep iterating on fused PTQ as the main path.

Do not add only more encoder-side calibration data.

Do not assume inference_output_type=tf.float32 is the root cause.

Do not assume ONNX/CTranslate2 success transfers directly to TFLite.

Do not attempt QAT on the high-level fused Hugging Face model first.

Do not debug the custom delegate until CPU INT8 is correct.


Practical answer

If the question is:

Is there a solution?

The honest answer is:

Yes, in principle. But not as a turnkey TFLiteConverter PTQ workflow.

The most plausible TFLite-native solution is:

1. split encoder and decoder_step
2. calibrate decoder_step explicitly with real decoder prefixes
3. handle encoder→decoder requantization
4. use first-step source-sensitivity tests
5. use Quantization Debugger around cross-attention
6. use decoder-step QAT if PTQ fails
7. add custom delegate support only after CPU INT8 works

The most plausible non-TFLite solution is:

CTranslate2 or ONNX Runtime

The most robust product solution, if TFLite full INT8 is mandatory and QAT still fails, is:

distill or redesign the model for the delegate

Short summary

  • There is probably no simple converter flag that fixes this.
  • Fused full-INT8 PTQ is probably a dead end for this model class.
  • The first real solution is encoder.tflite + decoder_step.tflite.
  • The decoder needs representative calibration with real decoder prefixes.
  • The encoder→decoder quantized boundary must be handled explicitly.
  • Cross-attention K/V may need to move to the encoder side.
  • Use Quantization Debugger to locate the first bad tensor.
  • Decoder-step QAT is the next realistic TFLite-native path.
  • A custom attention delegate may be required for strict embedded INT8.
  • If runtime constraints can change, CTranslate2 or ONNX Runtime is far more mature.
  • If constraints cannot change, distillation/redesign may be the most reliable product path.