On the HF website, if you open the model details, it will show the correct (complex and lengthy) chat template. Locally I only get a dumbed down version of it:
{{ if .System }}<|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
Turns out, this is what HF serves via the ollama model registry:
When I look into the gguf myself, the correct tokenizer.chat_template is still there.
This happens for multiple large quantizers, so the question is:
Is this a configuration error made by the quantizers, e.g. @bartowski, or a general HF issue?
The “official” version hosted by Ollama themselves does not seem to have this problem.
This is my first time here, please be gentle. I did research on this topic and didn’t find an answer.
Likely cause: HF’s Ollama registry is serving a lossy template, not the quantizer breaking the GGUF
I think you found a real integration-layer bug, or at least a dangerous fallback in the Hugging Face → Ollama compatibility path.
The short answer is:
This does not look primarily like a @bartowski / quantizer configuration error, assuming your GGUF inspection is correct. If the GGUF still contains the full tokenizer.chat_template, then the quantized file likely preserved the important metadata. The suspicious transformation happens later, when Hugging Face exposes the model through the Ollama-compatible registry endpoint.
The failing boundary appears to be:
GGUF metadata:
tokenizer.chat_template = full / complex / Gemma 4-specific
↓ Hugging Face Ollama compatibility layer
hf.co/v2/<repo>/manifests/<tag>:
application/vnd.ollama.image.template = short generic Go template
↓ Ollama pull/run via hf.co
ollama show --modelfile hf.co/<repo>:<tag>:
TEMPLATE = same short generic Go template
That is why the official Ollama model can behave differently: the official Ollama Gemma 4 path uses Ollama’s own Gemma 4 renderer, while the hf.co/v2 path appears to serve a static Ollama TEMPLATE layer.
A chat template is not just formatting. It is the serialization contract between structured chat messages and the raw token sequence the model actually sees.
A chat model does not literally receive this abstract structure:
It receives a rendered prompt string/token stream, for example with special role markers, turn delimiters, BOS/EOS behavior, tool declarations, image placeholders, thinking markers, and stop tokens. If that rendering is wrong, the model can load successfully but behave strangely.
So when Ollama locally sees only this simplified template:
{{ if .System }}<|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
or the HF registry blob serves:
{{ if .System }}<bos><|turn>system
{{ .System }}<turn|>
{{ end }}{{ if .Prompt }}<|turn>user
{{ .Prompt }}<turn|>
{{ end }}<|turn>model
{{ .Response }}<turn|>
that is not merely a shorter display version. It may be a materially different prompt format.
For simple one-turn text chat, this can appear to work. For Gemma 4, it is very likely incomplete.
Why I would not primarily blame the quantizer
A quantizer-side problem would be likely if one of these were true:
Observation
Likely interpretation
tokenizer.chat_template is missing from the GGUF
bad conversion / incomplete GGUF metadata
tokenizer.chat_template inside the GGUF is already simplified
converter or quantizer likely damaged metadata
the HF GGUF viewer also shows the simplified template
GGUF metadata likely wrong
the repo contains an explicit bad template file
repo packaging issue
the problem appears only in one quantizer’s repo
repo-specific issue more likely
But your evidence is different:
Layer
Your observation
What it suggests
GGUF metadata
full tokenizer.chat_template still exists
quantizer likely preserved the template
HF model details / GGUF view
shows the full complex template
HF can read the correct metadata
hf.co/v2/.../manifests/IQ2_XXS
contains a short application/vnd.ollama.image.template layer
registry compatibility layer is suspicious
template blob
159-byte generic Go template
conversion/selection/fallback likely lost semantics
local Ollama model
sees the simplified template
Ollama is consuming the served template
official Ollama Gemma 4
does not show this same problem
official path uses different rendering/configuration
That points away from “bad quantization” and toward “bad Ollama image template generated by the registry bridge.”
A quantizer can still add a workaround, but that is different from being the root cause.
The key distinction: GGUF metadata vs Ollama image template
The GGUF can contain one template while the Ollama registry image exposes another.
Your GGUF contains:
tokenizer.chat_template
The Ollama-compatible registry manifest contains:
application/vnd.ollama.image.template
Those are not the same artifact.
HF’s Ollama docs say that, by default, the template for ollama run hf.co/<namespace>/<repo> is selected from commonly used templates based on the GGUF’s built-in tokenizer.chat_template. The same docs say that if a repo provides a custom template file, it must be a Go template, not a Jinja template: HF Ollama docs.
So the intended pipeline is roughly:
GGUF tokenizer.chat_template
→ HF template selection / conversion
→ Ollama Go TEMPLATE
→ application/vnd.ollama.image.template
→ Ollama local Modelfile
Your evidence suggests that the pipeline is losing information in the middle.
Why Jinja → Go template conversion is fragile
HF / Transformers chat templates are generally Jinja-style templates. Ollama TEMPLATE uses Go template syntax.
Ollama’s own docs say TEMPLATE is the full prompt template passed to the model and that templates use Go template syntax: Ollama Modelfile Reference.
That means a bridge has to do one of these:
convert the Jinja template into a Go template;
map the Jinja template to a known built-in Go template;
use a custom model-family-specific handler;
fall back to a simpler template;
or fail.
For simple templates, this may be fine. For Gemma 4, Qwen thinking models, multimodal models, and native tool-calling models, this is fragile.
The public HF package @huggingface/ollama-utils is especially relevant because it says it handles conversion of GGUF/Jinja chat templates into the Go format used by Ollama. It also explicitly lists “the converted template is wrong” as a valid reason to add a custom handler/test.
That is almost exactly your case.
Why Gemma 4 is a bad fit for a tiny generic template
Gemma 4 is not just a plain text chat model with user and assistant turns.
The official Ollama Gemma 4 renderer handles Gemma 4-specific prompt rendering in code: Ollama gemma4.go. The renderer deals with things such as:
BOS emission;
system/developer messages;
<|turn> / <turn|> markers;
<|"|> string delimiters;
thinking mode;
tools;
tool declarations;
tool calls;
tool responses;
image tags;
adjacent assistant/model turns;
stripping thinking blocks;
generation-prompt behavior.
Google’s Gemma 4 function-calling docs also show that tools are passed through apply_chat_template() via the tools argument: Google Gemma 4 function calling.
vLLM’s Gemma 4 guide similarly treats Gemma 4 as needing specialized support for reasoning, tool calling, and dynamic multimodal behavior: vLLM Gemma 4 usage guide.
So the official / full behavior surface is closer to:
If that application/vnd.ollama.image.template layer is a simplified fallback, Ollama may simply use the bad template it was given.
So this difference is expected:
ollama run gemma4:<official-tag>
→ official Ollama packaging / renderer path
ollama run hf.co/bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ2_XXS
→ HF-generated Ollama registry image
→ static application/vnd.ollama.image.template
That does not mean the GGUF is bad. It means the wrapper/rendering path is different.
Most likely root cause
My ranking:
1. Most likely: HF Ollama template conversion/selection fallback
HF reads the GGUF template, tries to convert or classify it, and emits a generic Gemma-ish Go template instead of a faithful Gemma 4 template.
Possible mechanisms:
Gemma 4 is not handled by a custom mapping.
The Jinja template is too complex or non-linear.
The converter only supports a subset of the template.
The matcher recognizes the <|turn> markers and chooses a generic template.
Tool/thinking/multimodal branches are dropped.
A fallback template is emitted instead of failing loudly.
This is the most likely explanation.
2. Also possible: a static Go TEMPLATE cannot fully express official Gemma 4 rendering
Ollama’s official Gemma 4 support is renderer code, not just a template string.
Some behavior may be awkward or impossible to express faithfully in a static Go template, especially if the renderer needs to restructure messages, merge tool results, strip thinking, or parse tool calls.
So there are two different levels of fix:
Level
Possible fix
Narrow HF fix
Add a better Gemma 4 mapping/custom handler in the HF Ollama compatibility layer
Better Ollama fix
Let imported Gemma 4 GGUFs use the same Gemma 4 renderer path as official models
Broader ecosystem fix
Support Jinja chat templates directly in Ollama
There is already an Ollama feature request for Jinja chat-template support: ollama/ollama#10222.
3. Possible but less likely: repo-level template override
HF supports a repo-level template file for Ollama, but it must be a Go template: HF Ollama docs.
If the repo contains such a file and it is bad, that could be a repo packaging issue. But from your evidence, the important template is being served as an HF registry layer while the GGUF metadata remains correct.
4. Least likely from your evidence: quantizer damaged the GGUF
This becomes likely only if the GGUF metadata itself is missing, truncated, or simplified.
You said the opposite: the correct tokenizer.chat_template is still there.
How to prove the failing boundary cleanly
Package three artifacts:
the GGUF metadata;
the HF application/vnd.ollama.image.template blob;
llama.cpp’s template wiki says llama_chat_apply_template() uses the template stored in model metadata key tokenizer.chat_template by default and includes a Jinja parser called minja: llama.cpp template wiki.
5. Summarize the proof
Source
Result
GGUF metadata
full Gemma 4 tokenizer.chat_template
HF application/vnd.ollama.image.template blob
short generic Go template
ollama show --modelfile
same short generic Go template
That table makes the issue very clear.
Behavior tests to run
Do not test only hello. A generic template can pass trivial chat while failing important branches.
Test cases that matter:
Test
What it checks
one-turn prompt
baseline behavior
system prompt
system role rendering
multi-turn chat
history loop
assistant-history turn
assistant/model role rendering
thinking on/off
Gemma 4 thinking control
tool declaration
tool schema serialization
tool call
tool-call formatting
tool response
tool-response formatting
image input
multimodal placeholder handling
long answer / stop leak
stop tokens and turn terminators
The simplified template may only pass the first one or two.
Workarounds
Workaround 1: Use the official Ollama Gemma 4 model
For normal local usage, this is the safest workaround:
ollama pull gemma4:26b
ollama run gemma4:26b
Reason: the official Ollama path can use the dedicated Gemma 4 renderer: Ollama Gemma 4 renderer.
Downside: you may not get the exact community quant you wanted.
Workaround 2: Use llama.cpp / vLLM / another direct GGUF path
For testing the exact GGUF quant, use a runtime path that can apply the embedded template more directly.
Is the quantized GGUF itself bad, or is the Ollama wrapper bad?
If the same GGUF behaves better through llama.cpp/vLLM with a correct template, that supports the wrapper/template diagnosis.
Workaround 3: Import the GGUF manually into Ollama
You can bypass the hf.co/v2 registry path:
FROM /absolute/path/to/google_gemma-4-26B-A4B-it-IQ2_XXS.gguf
Then:
ollama create gemma4-local -f Modelfile
ollama show --modelfile gemma4-local
But this is not automatically a full fix. Manual import bypasses the HF registry template blob, but you still need a correct Ollama template or renderer behavior.
Workaround 4: Add a repo-level template file, if a faithful Go template exists
HF allows a repo-level template file for Ollama, but it must be a Go template, not Jinja: HF Ollama docs.
This may help for some models. For Gemma 4, be careful: a partial Go template can fix basic chat while still breaking tools, thinking, images, and parser behavior.
Workaround 5: Render the prompt yourself
For serious application testing, the most controlled workaround is:
use the HF tokenizer/processor;
apply the correct chat template yourself;
send the rendered prompt through a completion-style path;
The GGUF metadata appears correct. The problem appears between GGUF tokenizer.chat_template and the generated Ollama image template layer.
Secondary target: Ollama
Report to Ollama if you can show that imported Gemma 4 GGUFs should use the built-in Gemma 4 renderer but do not, or that the static TEMPLATE mechanism cannot represent official Gemma 4 rendering.
The GGUF metadata appears to contain the full tokenizer.chat_template, but the HF Ollama registry path is serving a simplified application/vnd.ollama.image.template layer. If a faithful Gemma 4 Go template is available, adding a repo-level template file might work around the issue for Ollama users.
That avoids wrongly blaming the quantizer.
Suggested issue body
## Summary
For `bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ2_XXS`, the GGUF metadata appears to contain the full Gemma 4 `tokenizer.chat_template`, but the HF Ollama registry endpoint serves a much shorter `application/vnd.ollama.image.template` layer.
When the model is pulled through Ollama using `hf.co/...`, the local Ollama Modelfile uses this simplified template.
This appears to lose important Gemma 4 chat-template semantics.
## Affected model
- Repo: `bartowski/google_gemma-4-26B-A4B-it-GGUF`
- Tag: `IQ2_XXS`
- Possibly affects other Gemma 4 GGUF repos/tags.
## Steps to reproduce
```sh
REPO="bartowski/google_gemma-4-26B-A4B-it-GGUF"
TAG="IQ2_XXS"
curl -sSf -L \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://hf.co/v2/${REPO}/manifests/${TAG}" \
| jq .
A simplified static template can make the model appear to run while silently degrading chat, tool, thinking, multimodal, or parser behavior.
Evidence to attach
hf-v2-manifest.json
hf-v2-template.txt
ollama-show-modelfile.txt
GGUF metadata snippet showing the full tokenizer.chat_template
---
## What maintainers could do
HF-side fixes:
- Add a Gemma 4-specific mapping/handler in the Ollama compatibility layer.
- Add tests comparing converted Go-template output against expected Gemma 4 rendering.
- Avoid silently serving a simplified fallback when conversion is incomplete.
- Document cases where a model’s Jinja template cannot be faithfully represented as an Ollama Go `TEMPLATE`.
Ollama-side fixes:
- Support Jinja chat templates directly.
- Let imported Gemma 4 GGUFs use equivalent Gemma 4 renderer behavior when metadata identifies the architecture.
- Expose clearer diagnostics when a model falls back to a generic template.
Quantizer/repo-side mitigations:
- Preserve `tokenizer.chat_template` in GGUF metadata.
- Document known Ollama limitations.
- Optionally add a repo-level Go `template` file if a faithful one exists.
---
## Final answer
So, answering your question directly:
**Is this a quantizer configuration error?**
Probably not, assuming the GGUF metadata really contains the full template.
**Is it a general HF issue?**
Most likely yes: more specifically, an issue in the HF Ollama registry / template conversion / template selection layer.
**Is Ollama also involved?**
Yes, structurally. Ollama uses Go templates for `TEMPLATE`, while HF/GGUF templates are generally Jinja-style. Official Gemma 4 in Ollama has a custom renderer, which imported HF GGUFs may not get.
**Why does the official Ollama model behave better?**
Because it can use Ollama’s dedicated Gemma 4 renderer. The HF `hf.co/v2` path appears to provide a generated static template layer instead.
**What should you do now?**
1. Use official Ollama Gemma 4 for normal local usage.
2. Use llama.cpp/vLLM/direct GGUF paths for testing the exact quant.
3. File a precise issue against HF’s Ollama compatibility layer.
4. Include the manifest, template blob, `ollama show --modelfile`, and GGUF metadata.
5. Do not primarily blame the quantizer unless the GGUF metadata itself is wrong.
The best one-sentence report is:
> The GGUF metadata contains the full Gemma 4 `tokenizer.chat_template`, but the HF `hf.co/v2` Ollama registry emits a simplified `application/vnd.ollama.image.template` layer, causing `ollama run hf.co/...` to use a template that does not preserve Gemma 4 chat-template semantics.