Instructions to use Qwen/Qwen3.6-35B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.6-35B-A3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-35B-A3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-35B-A3B")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-35B-A3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
AMD Developer Cloud
Local Apps

vLLM

How to use Qwen/Qwen3.6-35B-A3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.6-35B-A3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.6-35B-A3B

SGLang

How to use Qwen/Qwen3.6-35B-A3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.6-35B-A3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.6-35B-A3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.6-35B-A3B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.6-35B-A3B
```

My RTX 3090 ran out of excuses: Qwen3.6-35B-A3B

#37

by Kukedlc - opened Apr 18

Discussion

Kukedlc

Apr 18

•

edited Apr 18

I've been testing every model that comes out since GPT-2. No academic benchmarks, no MMLU, no HumanEval. My benchmark is the 24GB of VRAM on my RTX 3090 and the real tasks I need to solve day to day as a data scientist. Every new model that dropped I'd
download it, run it, throw my tasks at it, and always end up with the same feeling: cool but not enough. Years of this.

Qwen3.6-35B-A3B with Unsloth's Q3 quant takes 23GB of VRAM, runs at 120 tok/s and it's the first one that saturated my benchmark. I have about ten different skills I throw at every model I test. Full Power BI dashboards using Microsoft's MCP server with a
custom piece of mine for chart generation: nailed it. Causal inference tasks: nailed it. Interactive benchmarks where it has to iterate on what it sees on screen: nailed it. Multi-step web search with cross-constraints: nailed it. I've been running it for three days through OpenCode and so far it hasn't let me down on anything I've thrown at it. Too early to call it a daily driver, but the first impression is stronger than anything I've tested locally before.

To be clear, it's not magic. Several tasks I had to reinforce prompts, adapt my skills to its level of comprehension, build tools to cover gaps that models like Claude Opus solve one-shot without blinking. But that's exactly what's interesting: the distance
between "needs adaptation" and "can't do it" is massive, and this model is firmly on the right side of that line. It's the first time with an open source model where I feel like the bottleneck is me writing better prompts and not the model failing to
understand what I'm asking. After years of testing everything that came out and enduring the frustration of models that promised a lot in papers and delivered little in the terminal, getting to this point running offline on my desk feels like a point of no
return.

Kudos to the Qwen team and Unsloth for making this happen on consumer hardware. My llama.cpp config for anyone wanting to replicate:

llama-server
--model Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
-ngl 999 -fa on --no-mmap
-c 262144 -n 32768 --no-context-shift
--jinja --reasoning-format deepseek --reasoning-budget 4096
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
--presence-penalty 0.0
--cache-type-k bf16 --cache-type-v bf16
--port 8181

ulymp

Apr 19

May I ask why you are not using Q4 quants and the kv cache quantization set to q8_0? In my experience the q8 KV cache quantization doesn't bring any quality loss, but saves quite a lot of memory.

pokexpert

Apr 19

My setup is q4_k_m and rotorquant with planar3/turbo3. 262k at q4 power + opencode. Speed is not the best there is but its there

n_tokens = 128963
prompt eval time =    1694.57 ms /  1955 tokens (    0.87 ms per token,  1153.68 tokens per second)
       eval time =  229320.74 ms /  7501 tokens (   30.57 ms per token,    32.71 tokens per second)
      total time =  231015.32 ms /  9456 tokens

ghostwithahat

Apr 19

•

edited Apr 19

Another 3090 user here:

llama-server
--min-p 0.0
--jinja
--chat-template-file /opt/models/Qwen3.6-35B-A3B-heretic/chat_template.jinja
--cache-type-k turbo4
--cache-type-v turbo4
--threads 16
--flash-attn on
--model /opt/models/Qwen3.6-35B-A3B-heretic/Qwen3.6-35B-A3B-heretic.IQ4_NL.gguf
--ctx-size 262144
--n-gpu-layers 99
--temp 0.6
--top-p 0.95
--top-k 20
--repeat-penalty 1.0
--repeat-last-n 256
--perf

For normal "chatting", I like the big dense gemma 4 better. But Qwen3.6 seems to work better for agentic use.
BTW: I had a lot of deadlocks with hermes-agent on Qwen3.6. I had to set config.memory.nudge_interval=0 and config.memory.flush_min_turns=0 to fix it.

owao

Apr 19

•

edited Apr 19

https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks

Using UD-Q5_K_XL (on 3090 too, 131K context ~75t/s @10K , ~65t/s @120K), I feel exactly the same! A bit more of user prompt explicitness is required sometimes, but once it's on its rails, it goes to the final destination!
Just wanted to add that --chat-template-kwargs '{"preserve_thinking":true}' has been beneficial for me when it comes to autonomous agentic tasks (with t 0.6 and no presence penalty), give it a try!

Congrats to the team who designed the training, really great job.

owao

Apr 19

•

edited about 1 month ago

Also, as a tip, when using llama.cpp with MOE models, you can go with much higher quants (Q5_XL is 26.6GB). The magic is not to set n-gpu-layers (as you of course would for a dense model for max speed) to let llama.cpp do its own weights offloading optimization instead. Just set your --ctx-size and that's all. You will get a good speed even when the GGUF size exceed the VRAM size.
I also just discovered that you can easily also set how much VRAM you want to remain free for other applications with --fit-target 2048 for 2GB for example (default is 1024) which I found a bit short and can get OOM if you happened to eat up additional VRAM AFTER llama.cpp did its optimization and loaded the model. So, either you load the model when you know you won't use any additional VRAM later, or you set a margin by advance.

owao

Apr 19

About the new kv cache performance: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150833469
As @ulymp mentionned q8 now has greatly improved: AIME25 eval https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357

kingjamz

29 days ago

I just bought a second hand 3090 this year, thanks for all the tips and strategies for maximizing performance with large models. I have been using quants for a while, my intro to AI was on a A1000 6gb laptop, quants were the only thing I could run. When I got the other 18gbs of vram I thought "just throw the full model at it and just wait if needed" but clearly from these posts thats what an amateur would do! Time for me to refine the workflow to maximize performance. My PC also has 64gb of ddr4 memory, granted newer memory would also help, but I dont have a thousand bucks for each stick of memory.
Thanks for this post, it makes me want to try harder and not settle.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment