Instructions to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF", filename="nemotron-diffusion-14b-Q8_0.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Use Docker
docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
- LM Studio
- Jan
- vLLM
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
- Ollama
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Ollama:
ollama run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
- Unsloth Studio
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF to start chatting
- Pi
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Run Hermes
hermes
- Docker Model Runner
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Docker Model Runner:
docker model run hf.co/spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
- Lemonade
How to use spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF:Q8_0
Run and chat with the model
lemonade run user.Nemotron-Labs-Diffusion-14B-Q8_0-GGUF-Q8_0
List all available models
lemonade list
Nemotron-Labs-Diffusion-14B-Q8_0-GGUF
Q8_0 GGUF quantization of nvidia/Nemotron-Labs-Diffusion-14B for use with llama.cpp.
This is a tri-mode language model supporting AR decoding, diffusion-based parallel decoding, and self-speculation (diffusion drafting + AR verification with shared KV cache). The self-speculation mode achieves significant speedups over standard AR decoding.
Performance
RTX 3090 24GB, Q8_0, llama-server with self-speculation (k=4):
| Prompt type | Speed | Temperature |
|---|---|---|
| Code | 50 t/s | 0 |
| Prose | 42 t/s | 0 |
| Code | 45 t/s | 0.7 |
| Prose | 42 t/s | 0.7 |
| Short responses | 126-132 t/s | 0 |
For comparison, NVIDIA's Python reference implementation runs at 66 t/s on the 3B model and OOMs on the 14B model on 24GB VRAM.
Quickstart
Requires buun-llama-cpp (fork of llama.cpp with diffusion support). Build with CUDA:
git clone https://github.com/spiritbuun/buun-llama-cpp
cd buun-llama-cpp
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
-DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc \
-DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build -j$(nproc)
llama-server (recommended)
./build/bin/llama-server \
-m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
--port 8080 --host 0.0.0.0 \
-ngl 99 -c 4096 -np 1 -t 10 -fa on \
--reasoning-format none --reasoning off
The server auto-detects diffusion models and enables self-speculation. Use the standard OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nemotron",
"messages": [{"role": "user", "content": "Write quicksort in Python"}],
"max_tokens": 512,
"temperature": 0.7
}'
Options
| Flag | Description |
|---|---|
-ngl 99 |
Offload all layers to GPU |
-c 4096 |
Context size |
-np 1 |
Number of parallel slots (1 recommended for max speed) |
-fa on |
Flash attention |
--reasoning-format none --reasoning off |
Clean output without think tags |
temperature |
0 = greedy (fastest), 0.3-0.7 = good variety, 1.0 = max variety |
CLI
./build/bin/llama-diffusion \
-m Nemotron-Labs-Diffusion-14B-Q8_0.gguf \
-ngl 99 -c 4096 -fa on \
--diffusion-self-spec --diffusion-draft-length 4 \
-p "Write a Python function to merge two sorted arrays" \
-n 512
Notes
- Temperature is supported via rejection sampling in the self-speculation pipeline. temp=0 is pure argmax (fastest). Higher temperatures reduce acceptance rates slightly.
- The model benefits from greedy or low-temperature sampling. NVIDIA's reference implementation uses temp=0.
- Loop detection is built in: if the model enters a repetitive pattern, generation stops cleanly.
- Multi-turn conversations are supported with automatic prompt history cleanup.
The remainder of this card is reproduced from NVIDIA's original model card.
Model Overview
Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.
Highlights
- SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
- Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
- Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9x tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
- Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
- Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.
License/Terms of Use
Use of this model is governed by the NVIDIA Nemotron Open Model License.
Chat with Our Model (Python, original weights)
from transformers import AutoModel, AutoTokenizer
import torch
repo_name = "nvidia/Nemotron-Labs-Diffusion-14B"
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_name, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)
history = []
user_input = input("User: ").strip()
history.append({"role": "user", "content": user_input})
prompt = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device='cuda')
## Chat in AR Mode
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)
## Chat in dLM Mode
out_ids, nfe = model.generate(prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=tokenizer.eos_token_id)
## Chat in Linear Self-Speculation Mode
out_ids, nfe = model.linear_spec_generate(prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=tokenizer.eos_token_id)
tokenized_out = tokenizer.batch_decode(out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True)[0]
print(f"Model: {tokenized_out}")
print(f"[Num Function Eval (NFE)={nfe}]")
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
Citations
@techreport{fu2026nemotronlabsdiffusion,
title = {Nemotron-Labs-Diffusion: A Tri-Mode Language Model Unifying Autoregressive, Diffusion, and Self-Speculation Decoding},
author = {Yonggan Fu and Lexington Whalen and Abhinav Garg and Chengyue Wu and Maksim Khadkevich and Nicolai Oswald and Enze Xie and Daniel Egert and Sharath Turuvekere Sreenivas and Shizhe Diao and Chenhan Yu and Ye Yu and Weijia Chen and Sajad Norouzi and Shiyi Lan and Ligeng Zhu and Jin Wang and Jindong Jiang and Morteza Mardani and Mehran Maghoumi and Song Han and Ante Jukic and Nima Tajbakhsh and Jan Kautz and Pavlo Molchanov},
institution = {NVIDIA},
year = {2026},
note = {Technical report}
}
- Downloads last month
- 1,524
8-bit
Model tree for spiritbuun/Nemotron-Labs-Diffusion-14B-Q8_0-GGUF
Base model
nvidia/Nemotron-Labs-Diffusion-14B