The correct answers to those questions may not exist, may vary depending on the type of output you want, or may depend on the software (backend) used to handle GGUF…
The right way is to treat GGUF setup as three separate layers:
- Model file: the
.gguf weights and metadata
- Runtime/backend: Ollama, llama.cpp, llama-cpp-python, LM Studio, and similar
- Prompt/rendering layer: template, system prompt, stop strings, context size, and sampling settings
Most “bad GGUF results” come from getting layer 2 or 3 wrong, not from GGUF itself. GGUF is a binary format for inference with GGML-based executors. It is not, by itself, a full packaged runtime configuration. (GitHub)
The short answer
The safest general rule is:
- Do not invent the template
- Do not rely on backend defaults
- Do not compare setups unless quant, template, stop strings, context, and sampling all match
- Inspect what the backend is actually using before you override anything (Hugging Face)
If you follow that rule, GGUF models can perform at the same level as preconfigured models on a given backend. When they do not, the cause is usually one of these:
- wrong chat template
- wrong stop strings
- too-small context
- different backend defaults
- different quantization
- missing model-specific features such as tool or document support in the template. (Hugging Face)
1. The correct workflow
There is not one universal workflow. The correct workflow depends on the backend.
A. Raw llama.cpp workflow
For llama.cpp, the clean path is:
- download a trusted GGUF
- run it directly in
llama-cli or llama-server
- let the runtime use the model’s embedded
tokenizer.chat_template by default
- set context and sampling explicitly
- benchmark before tuning style. (GitHub)
With llama-server, there is usually no separate model-creation step. You point the server at the GGUF and run it. The server’s documented default is --ctx-size 0, which means “load context size from model metadata.” The server also exposes /apply-template to show exactly how messages are being rendered into a prompt string. (GitHub)
B. llama-cpp-python workflow
For llama-cpp-python, the workflow is similar:
- load the GGUF
- use chat completion rather than manually concatenating prompts
- let the library choose the chat format automatically unless you know you must override it
- enable
verbose=True so you can see which chat format was selected
- set sampling explicitly. (GitHub)
The documented precedence is:
chat_handler
chat_format
tokenizer.chat_template from GGUF metadata
- fallback to
llama-2 format. (GitHub)
That precedence order is important because it tells you exactly where mistakes can creep in.
C. Ollama workflow
For Ollama, the workflow is different because Ollama has a packaging layer called the Modelfile.
If you already have a GGUF, the official workflow is:
- create a
Modelfile
- set
FROM /path/to/file.gguf
- run
ollama create my-model
- run
ollama run my-model. (Ollama Documentation)
Ollama’s docs describe the Modelfile as the blueprint for a model and document FROM, PARAMETER, TEMPLATE, SYSTEM, ADAPTER, MESSAGE, and REQUIRES. Ollama also lets you inspect any packaged model with ollama show --modelfile, and the API’s show model details endpoint returns the model’s parameters, template, capabilities, and metadata. (Ollama Documentation)
2. The most important principle: the template is usually more important than the knobs
This is the core of the whole topic.
Hugging Face’s chat-template guidance states that using a format different from what the model was trained on will usually cause severe, silent performance degradation. That is the strongest general statement available on this subject, and it matches what people see in practice. (Hugging Face)
So when you ask:
“How do I know whether ChatML or LLaMA format is correct?”
The answer is:
You do not guess. You inspect. (Hugging Face)
Use this order:
- embedded template in GGUF metadata
- model card or official model docs
- backend inspection tools such as
ollama show --modelfile or /api/show
- manual override only if needed. (GitHub)
In llama.cpp, llama_chat_apply_template() uses the template stored in tokenizer.chat_template by default. In llama-cpp-python, the same metadata is part of the selection chain. In Ollama, TEMPLATE is an explicit Modelfile instruction, and ollama show --modelfile reveals the packaged version. (GitHub)
3. What the “right” template workflow looks like
Best default rule
Start from the model’s own template. Do not replace it unless you have evidence that you should. (GitHub)
How to verify support for special features
If you need RAG documents or tool calling, you must verify that the template actually supports them. Hugging Face’s docs say many templates simply ignore the documents input, and recommend checking the model card or printing the chat template to see whether the relevant key is present. (Hugging Face)
Backend-specific gotcha
In llama.cpp server, only models with a supported chat template work optimally with /v1/chat/completions, and the server says that by default the ChatML template will be used there. It also supports /apply-template, which is the easiest way to check whether the rendered prompt looks correct before generation. (GitHub)
That means for llama.cpp the safe path is:
- use
/v1/chat/completions for chat
- inspect
/apply-template when debugging
- only force
--chat-template or --chat-template-file when metadata/default behavior is wrong. (GitHub)
4. The right way to build a Modelfile
If you are using Ollama with a GGUF, the best Modelfile is usually minimal, not clever.
A good starting pattern is:
FROM /path/to/model.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 0.8
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
SYSTEM """You are a helpful assistant."""
This follows the official structure of FROM, PARAMETER, and SYSTEM, and avoids rewriting TEMPLATE unless you know the model needs a custom one. Ollama documents that templates are model-specific and use Go template syntax. It also shows how to inspect an existing packaged model and copy its template and stop sequences into a new Modelfile only if you need to. (Ollama Documentation)
When to include TEMPLATE
Only do it if one of these is true:
- the imported GGUF is missing the right prompt wrapper
- you are reproducing a known-good packaged model
- you are working with a model card that explicitly requires a specific format
- you need advanced custom tool or system behavior. (Ollama Documentation)
A good practical rule is:
First inspect a stock model with ollama show --modelfile. Then copy only the parts you actually need. (Ollama Documentation)
5. Recommended parameter values
There is no universal “best” preset. But there are good starting values.
Current documented defaults differ by backend:
| Setting |
Ollama Modelfile default |
llama.cpp server/cli default |
llama-cpp-python default |
Practical start |
| temperature |
0.8 |
0.8 |
0.8 |
0.8 |
| top_k |
40 |
40 |
40 |
40 |
| top_p |
0.9 |
0.95 |
0.95 |
0.9 to 0.95 |
| min_p |
0.0 |
0.05 |
0.05 |
0.0 to 0.05 |
| repeat_penalty |
1.1 |
1.0 |
1.0 |
1.0 to 1.1 |
| repeat_last_n |
64 |
64 |
backend-dependent |
64 |
| num_ctx / n_ctx |
2048 |
server: 0 = model metadata |
library-config dependent |
set explicitly |
These values come directly from the current Ollama Modelfile docs, llama.cpp CLI/server docs, and llama-cpp-python API reference. The main lesson is not “copy one number.” The lesson is backend defaults differ, so set them explicitly when you care about reproducibility. (Ollama Documentation)
My recommended starting profiles
General chat
- temperature:
0.8
- top_k:
40
- top_p:
0.9 to 0.95
- min_p:
0.0 to 0.05
- repeat_penalty:
1.0 to 1.1
- repeat_last_n:
64
This is close to the documented defaults and works as a neutral baseline. (Ollama Documentation)
Deterministic evaluation
- fixed
seed
- stable template
- explicit context
- avoid creative sampling drift
For correctness checks, the important thing is not “a magic low temperature.” It is using the same seed and the same decoding behavior across runs. llama.cpp maintainers and guides emphasize explicit sampling settings when comparing outputs. (GitHub)
Creative writing
- higher temperature
- possibly higher top_p
- keep repeat controls on
This changes style and diversity, but it is not the right mode for verifying setup correctness. (Ollama Documentation)
6. How much do these parameters affect quality?
Not all knobs matter equally.
Highest impact on correctness
- template
- stop strings
- system prompt
- context size
- model-specific template kwargs or tool/document support. (Hugging Face)
These are the settings that can make a correct model look broken.
Medium impact
These affect diversity, conservatism, repetition, and determinism. They matter, but they usually do not explain catastrophic failures the way a wrong template does.
Separate but major impact
- quantization level
llama.cpp’s quantization docs say quantization reduces size and can speed inference, but may introduce accuracy loss, typically measured with perplexity and related metrics. Ollama’s import docs say the same tradeoff exists when quantizing models in Ollama. (GitHub)
7. What is the ideal context size?
There is no one ideal number. The right value is:
the smallest context that fully covers your real prompts and retrieved material without truncation. (Ollama Documentation)
Backend-specific rules
Ollama
Ollama’s context-length docs say default context length depends on available VRAM, and also say that for best performance you should use the maximum context length for the model and avoid CPU offload, verifying the actual split with ollama ps. (Ollama Documentation)
That means in Ollama you should not just set num_ctx blindly. You should also check whether the model is still fully on GPU.
llama.cpp server
The server default is --ctx-size 0, which means “load from model.” That is a good starting point because it respects the model metadata. (GitHub)
llama.cpp completion tool
The completion README documents a default context of 4096 for that tool, which is one more reason not to assume all backends behave the same. (GitHub)
Practical rule
- For casual chat: moderate context is fine
- For RAG, coding, agents, and long conversations: set more context explicitly
- Then check memory and offload behavior, not just the number you typed. (Ollama Documentation)
8. Standard, proven configurations
There is no universal gold standard. There are only good baseline patterns.
Pattern 1: minimal-change pattern
Use the model’s own template and stay close to backend defaults. This is the safest first run. (GitHub)
Pattern 2: explicit baseline pattern
Pin the settings you care about:
- template
- stop strings
- system prompt
- context
- temperature
- top_k
- top_p
- repeat_penalty
- seed. (Ollama Documentation)
Pattern 3: quantization workflow pattern
If you quantize yourself, use a high-precision GGUF as the master and quantize from that. llama.cpp warns that requantizing an already-quantized model can severely reduce quality compared with quantizing from 16-bit or 32-bit. (GitHub)
9. Common mistakes
1. Guessing the template
This is the biggest mistake. Hugging Face’s guidance explicitly says the wrong chat format causes silent degradation. (Hugging Face)
2. Trusting a backend fallback without checking
llama-cpp-python will fall back to llama-2 formatting if nothing else is available. That is convenient, but it is not proof that the model was trained on that format. (GitHub)
3. Relying on hidden defaults
Ollama, llama.cpp, and llama-cpp-python do not share identical defaults. If you do not set values explicitly, you are not actually comparing like with like. (Ollama Documentation)
4. Using the wrong endpoint for the job
In llama.cpp server, /v1/chat/completions expects chat-style messages and supported chat templates. If you want to inspect the exact rendered prompt, /apply-template is the correct debugging tool. (GitHub)
5. Assuming tools or RAG documents work just because the model is “chat tuned”
Hugging Face’s docs say many templates ignore documents. The same logic applies to tool formatting: support depends on template and runtime, not just on model branding. (Hugging Face)
6. Requantizing a quantized model
llama.cpp warns this can severely reduce quality. (GitHub)
7. Editing an Ollama model before inspecting it
Ollama’s own docs give you ollama show --modelfile and /api/show. Use them first. There are also issue reports showing imported GGUFs may not always carry the same template and parameter behavior as stock packaged models. (Ollama Documentation)
10. How to match preconfigured models on each backend
This is the practical recipe.
To match a known-good Ollama model
- inspect it with
ollama show --modelfile or /api/show
- copy the template, stop strings, and parameter values
- use the same quantization level
- use the same context
- only then compare outputs. (Ollama Documentation)
To match a known-good llama.cpp server setup
- inspect
/props for chat_template and default generation settings
- inspect
/apply-template for the exact rendered prompt
- keep
--ctx-size, sampling, and any chat_template_kwargs fixed
- compare on the same endpoint path. (GitHub)
To match a known-good llama-cpp-python setup
- enable
verbose=True
- confirm the selected
chat_format
- pin the same
n_ctx, seed, and sampling values
- avoid changing
chat_format unless metadata/model card says you must. (GitHub)
The equality checklist
For two setups to be a fair comparison, all of these should match:
- same base model
- same quant
- same template
- same stop strings
- same system prompt
- same context
- same sampling
- same seed
- same tool/document formatting behavior. (GitHub)
11. How to benchmark whether the configuration is correct
You need three different tests, not one.
A. Prompt-render correctness test
Before measuring quality, inspect the rendered prompt.
- Ollama:
ollama show --modelfile or /api/show
- llama.cpp server:
/apply-template
- llama-cpp-python:
verbose=True and inspect selected chat format. (Ollama Documentation)
If the prompt wrapper is wrong, every downstream benchmark is misleading.
B. Quality test
For the same model family and tokenizer, use llama-perplexity. The tool docs say it measures how well the model predicts the next token and that lower is better, but also warn that perplexity is not directly comparable across different tokenizers and that finetunes can score worse on perplexity while still producing better human-rated outputs. (GitHub)
So use perplexity for:
- same model family
- same tokenizer
- same backend or close backend
- comparing quants or config changes. (GitHub)
C. Speed test
Use llama-bench for llama.cpp. Its README is explicitly a performance testing tool and includes examples for generation speed, prompt processing, thread counts, and GPU offload comparisons. For Ollama, the API returns timing metrics such as total_duration, load_duration, prompt_eval_count, prompt_eval_duration, eval_count, and eval_duration. (GitHub)
D. Human A/B test
If you care about perceived quality, do at least a small blind comparison on real prompts. llama.cpp community work on blind quant testing used a Bradley–Terry ranking approach to compare quantized variants by human votes. That is not an official benchmark standard, but it is a good reminder that human preference testing often catches differences that raw throughput numbers do not. (GitHub)
12. A concrete backend-by-backend baseline
Ollama baseline
Use a minimal Modelfile first.
FROM /path/to/model.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 0.8
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
Then inspect with:
ollama show --modelfile mymodel
That is the most stable starting point because it follows Ollama’s documented structure and keeps the prompt layer simple. (Ollama Documentation)
llama.cpp server baseline
Start with model metadata and one clean slot.
llama-server -m model.gguf -c 0 --alias mymodel
Then inspect prompt rendering with /apply-template before changing templates. If you need tool calling or advanced model-specific Jinja behavior, use the server’s documented chat-template options and tool-calling path. (GitHub)
llama-cpp-python baseline
Start from metadata-driven chat formatting.
from llama_cpp import Llama
llm = Llama(
model_path="model.gguf",
n_ctx=8192,
verbose=True,
)
resp = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello"}],
temperature=0.8,
top_k=40,
top_p=0.95,
repeat_penalty=1.0,
)
This follows the documented chat-format precedence and keeps the first run close to library defaults. (GitHub)
Final recommendation
If you want one rule to remember, use this:
The correct GGUF configuration workflow is: inspect template → run with explicit context and sampling → benchmark prompt rendering first → benchmark quality and speed second → only then customize. (Hugging Face)
And if you want one practical rule for each backend:
- Ollama: start with a minimal Modelfile and inspect with
show before overriding TEMPLATE
- llama.cpp: trust metadata first, use
/apply-template, and pin ctx and sampling explicitly
- llama-cpp-python: let metadata select the chat format first, and only override if you have evidence. (Ollama Documentation)