Instructions to use Qwen/Qwen3.6-35B-A3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3.6-35B-A3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-35B-A3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-35B-A3B") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-35B-A3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- AMD Developer Cloud
- Local Apps
- vLLM
How to use Qwen/Qwen3.6-35B-A3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3.6-35B-A3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3.6-35B-A3B
- SGLang
How to use Qwen/Qwen3.6-35B-A3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen3.6-35B-A3B with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3.6-35B-A3B
My RTX 3090 ran out of excuses: Qwen3.6-35B-A3B
I've been testing every model that comes out since GPT-2. No academic benchmarks, no MMLU, no HumanEval. My benchmark is the 24GB of VRAM on my RTX 3090 and the real tasks I need to solve day to day as a data scientist. Every new model that dropped I'd
download it, run it, throw my tasks at it, and always end up with the same feeling: cool but not enough. Years of this.
Qwen3.6-35B-A3B with Unsloth's Q3 quant takes 23GB of VRAM, runs at 120 tok/s and it's the first one that saturated my benchmark. I have about ten different skills I throw at every model I test. Full Power BI dashboards using Microsoft's MCP server with a
custom piece of mine for chart generation: nailed it. Causal inference tasks: nailed it. Interactive benchmarks where it has to iterate on what it sees on screen: nailed it. Multi-step web search with cross-constraints: nailed it. I've been running it for three days through OpenCode and so far it hasn't let me down on anything I've thrown at it. Too early to call it a daily driver, but the first impression is stronger than anything I've tested locally before.
To be clear, it's not magic. Several tasks I had to reinforce prompts, adapt my skills to its level of comprehension, build tools to cover gaps that models like Claude Opus solve one-shot without blinking. But that's exactly what's interesting: the distance
between "needs adaptation" and "can't do it" is massive, and this model is firmly on the right side of that line. It's the first time with an open source model where I feel like the bottleneck is me writing better prompts and not the model failing to
understand what I'm asking. After years of testing everything that came out and enduring the frustration of models that promised a lot in papers and delivered little in the terminal, getting to this point running offline on my desk feels like a point of no
return.
Kudos to the Qwen team and Unsloth for making this happen on consumer hardware. My llama.cpp config for anyone wanting to replicate:
llama-server
--model Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
-ngl 999 -fa on --no-mmap
-c 262144 -n 32768 --no-context-shift
--jinja --reasoning-format deepseek --reasoning-budget 4096
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00
--presence-penalty 0.0
--cache-type-k bf16 --cache-type-v bf16
--port 8181
May I ask why you are not using Q4 quants and the kv cache quantization set to q8_0? In my experience the q8 KV cache quantization doesn't bring any quality loss, but saves quite a lot of memory.
My setup is q4_k_m and rotorquant with planar3/turbo3. 262k at q4 power + opencode. Speed is not the best there is but its there
n_tokens = 128963
prompt eval time = 1694.57 ms / 1955 tokens ( 0.87 ms per token, 1153.68 tokens per second)
eval time = 229320.74 ms / 7501 tokens ( 30.57 ms per token, 32.71 tokens per second)
total time = 231015.32 ms / 9456 tokens
Another 3090 user here:
llama-server
--min-p 0.0
--jinja
--chat-template-file /opt/models/Qwen3.6-35B-A3B-heretic/chat_template.jinja
--cache-type-k turbo4
--cache-type-v turbo4
--threads 16
--flash-attn on
--model /opt/models/Qwen3.6-35B-A3B-heretic/Qwen3.6-35B-A3B-heretic.IQ4_NL.gguf
--ctx-size 262144
--n-gpu-layers 99
--temp 0.6
--top-p 0.95
--top-k 20
--repeat-penalty 1.0
--repeat-last-n 256
--perf
For normal "chatting", I like the big dense gemma 4 better. But Qwen3.6 seems to work better for agentic use.
BTW: I had a lot of deadlocks with hermes-agent on Qwen3.6. I had to set config.memory.nudge_interval=0 and config.memory.flush_min_turns=0 to fix it.
https://unsloth.ai/docs/models/qwen3.6#unsloth-gguf-benchmarks
Using UD-Q5_K_XL (on 3090 too, 131K context ~75t/s @10K , ~65t/s @120K), I feel exactly the same! A bit more of user prompt explicitness is required sometimes, but once it's on its rails, it goes to the final destination!
Just wanted to add that --chat-template-kwargs '{"preserve_thinking":true}' has been beneficial for me when it comes to autonomous agentic tasks (with t 0.6 and no presence penalty), give it a try!
Congrats to the team who designed the training, really great job.
Also, as a tip, when using llama.cpp with MOE models, you can go with much higher quants (Q5_XL is 26.6GB). The magic is not to set n-gpu-layers (as you of course would for a dense model for max speed) to let llama.cpp do its own weights offloading optimization instead. Just set your --ctx-size and that's all. You will get a good speed even when the GGUF size exceed the VRAM size.
I also just discovered that you can easily also set how much VRAM you want to remain free for other applications with --fit-target 2048 for 2GB for example (default is 1024) which I found a bit short and can get OOM if you happened to eat up additional VRAM AFTER llama.cpp did its optimization and loaded the model. So, either you load the model when you know you won't use any additional VRAM later, or you set a margin by advance.
About the new kv cache performance: https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150833469
As @ulymp mentionned q8 now has greatly improved: AIME25 eval https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4150413357
I just bought a second hand 3090 this year, thanks for all the tips and strategies for maximizing performance with large models. I have been using quants for a while, my intro to AI was on a A1000 6gb laptop, quants were the only thing I could run. When I got the other 18gbs of vram I thought "just throw the full model at it and just wait if needed" but clearly from these posts thats what an amateur would do! Time for me to refine the workflow to maximize performance. My PC also has 64gb of ddr4 memory, granted newer memory would also help, but I dont have a thousand bucks for each stick of memory.
Thanks for this post, it makes me want to try harder and not settle.