Nemotron-Labs-Diffusion-14B-Q8_0-GGUF

Q8_0 GGUF quantization of nvidia/Nemotron-Labs-Diffusion-14B for use with llama.cpp.

This is a tri-mode language model supporting AR decoding, diffusion-based parallel decoding, and self-speculation (diffusion drafting + AR verification with shared KV cache). The self-speculation mode achieves significant speedups over standard AR decoding.

Performance

RTX 3090 24GB, Q8_0, llama-server with self-speculation (k=4):

Prompt type Speed Temperature
Code 50 t/s 0
Prose 42 t/s 0
Code 45 t/s 0.7
Prose 42 t/s 0.7
Short responses 126-132 t/s 0

For comparison, NVIDIA's Python reference implementation runs at 66 t/s on the 3B model and OOMs on the 14B model on 24GB VRAM.

Quickstart

Requires buun-llama-cpp (fork of llama.cpp with diffusion support). Build with CUDA:

git clone https://github.com/spiritbuun/buun-llama-cpp
cd buun-llama-cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
  -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc \
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build -j$(nproc)

llama-server (recommended)

./build/bin/llama-server \
  -m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
  --port 8080 --host 0.0.0.0 \
  -ngl 99 -c 4096 -np 1 -t 10 -fa on \
  --reasoning-format none --reasoning off

The server auto-detects diffusion models and enables self-speculation. Use the standard OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Write quicksort in Python"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Options

Flag Description
-ngl 99 Offload all layers to GPU
-c 4096 Context size
-np 1 Number of parallel slots (1 recommended for max speed)
-fa on Flash attention
--reasoning-format none --reasoning off Clean output without think tags
temperature 0 = greedy (fastest), 0.3-0.7 = good variety, 1.0 = max variety

CLI

./build/bin/llama-diffusion \
  -m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
  -ngl 99 -c 4096 -fa on \
  --diffusion-self-spec --diffusion-draft-length 4 \
  -p "Write a Python function to merge two sorted arrays" \
  -n 512

Notes

  • Temperature is supported via rejection sampling in the self-speculation pipeline. temp=0 is pure argmax (fastest). Higher temperatures reduce acceptance rates slightly.
  • The model benefits from greedy or low-temperature sampling. NVIDIA's reference implementation uses temp=0.
  • Loop detection is built in: if the model enters a repetitive pattern, generation stops cleanly.
  • Multi-turn conversations are supported with automatic prompt history cleanup.

The remainder of this card is reproduced from NVIDIA's original model card.


Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

Highlights

  • SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
  • Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
  • Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
    • 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
    • 5.9x tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
  • Real-device speed-up across platforms:
    • DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
    • GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
  • Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

License/Terms of Use

Use of this model is governed by the NVIDIA Nemotron Open Model License.

Chat with Our Model (Python, original weights)

from transformers import AutoModel, AutoTokenizer
import torch

repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

history = []

user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})

prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')

## Chat in AR Mode
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)

## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)

## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)

tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Citations

@techreport{fu2026nemotronlabsdiffusion,
  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
  institution = {NVIDIA},
  year        = {2026},
  note        = {Technical report}
}
Downloads last month
1,524
GGUF
Model size
14B params
Architecture
dream
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF

Quantized
(1)
this model