Instructions to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF",
	filename="nemotron-diffusion-14b-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Use Docker

docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Ollama
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Ollama:
```
ollama run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
```

Unsloth Studio

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Docker Model Runner:
```
docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
```

Lemonade

How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0

Run and chat with the model

lemonade run user.Nemotron-Labs-Diffusion-14B-Q8_0-GGUF-Q8_0

List all available models

lemonade list

Nemotron-Labs-Diffusion-14B-Q8_0-GGUF

Q8_0 GGUF quantization of nvidia/Nemotron-Labs-Diffusion-14B for use with llama.cpp.

This is a tri-mode language model supporting AR decoding, diffusion-based parallel decoding, and self-speculation (diffusion drafting + AR verification with shared KV cache). The self-speculation mode achieves significant speedups over standard AR decoding.

Performance

RTX 3090 24GB, Q8_0, llama-server with self-speculation (k=4):

Prompt type	Speed	Temperature
Code	50 t/s	0
Prose	42 t/s	0
Code	45 t/s	0.7
Prose	42 t/s	0.7
Short responses	126-132 t/s	0

For comparison, NVIDIA's Python reference implementation runs at 66 t/s on the 3B model and OOMs on the 14B model on 24GB VRAM.

Quickstart

Requires buun-llama-cpp (fork of llama.cpp with diffusion support). Build with CUDA:

git clone https://github.com/spiritbuun/buun-llama-cpp
cd buun-llama-cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
  -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc \
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build -j$(nproc)

llama-server (recommended)

./build/bin/llama-server \
  -m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
  --port 8080 --host 0.0.0.0 \
  -ngl 99 -c 4096 -np 1 -t 10 -fa on \
  --reasoning-format none --reasoning off

The server auto-detects diffusion models and enables self-speculation. Use the standard OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron",
    "messages": [{"role": "user", "content": "Write quicksort in Python"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Options

Flag	Description
`-ngl 99`	Offload all layers to GPU
`-c 4096`	Context size
`-np 1`	Number of parallel slots (1 recommended for max speed)
`-fa on`	Flash attention
`--reasoning-format none --reasoning off`	Clean output without think tags
`temperature`	0 = greedy (fastest), 0.3-0.7 = good variety, 1.0 = max variety

CLI

./build/bin/llama-diffusion \
  -m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
  -ngl 99 -c 4096 -fa on \
  --diffusion-self-spec --diffusion-draft-length 4 \
  -p "Write a Python function to merge two sorted arrays" \
  -n 512

Notes

Temperature is supported via rejection sampling in the self-speculation pipeline. temp=0 is pure argmax (fastest). Higher temperatures reduce acceptance rates slightly.
The model benefits from greedy or low-temperature sampling. NVIDIA's reference implementation uses temp=0.
Loop detection is built in: if the model enters a repetitive pattern, generation stops cleanly.
Multi-turn conversations are supported with automatic prompt history cleanup.

The remainder of this card is reproduced from NVIDIA's original model card.

Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

Highlights

SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9x tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

License/Terms of Use

Use of this model is governed by the NVIDIA Nemotron Open Model License.

Chat with Our Model (Python, original weights)

from transformers import AutoModel, AutoTokenizer
import torch

repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"

tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

history = []

user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})

prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')

## Chat in AR Mode
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)

## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)

## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)

tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Citations

@techreport{fu2026nemotronlabsdiffusion,
  title       = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
  author      = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
  institution = {NVIDIA},
  year        = {2026},
  note        = {Technical report}
}

Downloads last month: 1,524

GGUF

Model size

14B params

Architecture

dream

Hardware compatibility

8-bit

Model tree for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF

Base model

nvidia/Nemotron-Labs-Diffusion-14B

Quantized

(1)

this model