Practical match for 128Gb Strix Halo with 2x3090s? (inference for coding)

GotLLMs · May 13, 2026, 3:20pm

Hi guys! I have a 128gb Strix Halo (Minisforum, bought out of curiosity) and a gaming PC with a 3090 / Ryzen 5 8400F / 16+16gb ddr5 6000 ram on basic 1 GPU mobo.

I love using the Halo for experiments at my coding job, but having two PCs in addition to the work laptop is too much for me. I’m thinking of selling the Halo and upgrading my PC, since:

- power draw and size are not big issues for me

- I’d like to have a PC that’s more fixable and faster for occasional gaming than the Halo

- Would be nice to run small models for simple tasks super fast

- Halo devices are very much in demand where I live and I feel I can sell it for profit while the hype is on and depreciation hasn’t happened

What should I buy apart from another 3090 to match the Halo for bigger models? I know for small models that fit into 48gb double 3090s will be blazing fast compared to the Halo, but what do I need for 80B/120B models?

I was thinking the starting point is ASRock Z690 Taichi (it can do PCIe 8x8) with i7/i9 CPU and 2-channel DDR5, 64+ Gb. Can this setup’s fast VRAM+slow RAM match ‘mid-speed’ shared memory of the Halo?

I choose Z690 Taichi + i7/i9 cuz it’s inexpensive and everyday apps/games will work fast with these CPUs, compared to older Threadrippers like 2950x that are slower than my current Ryzen 5 8400F in single core. Is it realistic to match the Halo with this setup? Or do I need mobo with 4+ channel DDR4/5 and 16x PCIe slots, another 3090?

Also you can talk me out of selling the Halo if using it with the 3090 brings some amazing benefits unavailable with the imagined PC build.

Thank you!

John6666 · May 14, 2026, 12:25pm

Hmm… For now, unlike a single powerful GPU, multi-GPU setups are more complex to use. Depending on the LLM backend, they may not deliver the performance you expect. Additionally, depending on the design of your current PC case, you’ll need to consider heat dissipation and where to install the components:

Practical answer: 128GB Strix Halo vs 2×RTX 3090 for local LLM coding inference

My short answer:

A 2×RTX 3090 desktop is a very realistic upgrade if your real workload is fast local coding inference with 7B–70B models, plus occasional 80B experiments.

It is not a clean replacement for a 128GB Strix Halo box if your real target is dense 100B/120B-class models with useful context length.

The reason is simple but important:

2×RTX 3090 gives you 48GB of extremely fast but split VRAM. Strix Halo gives you up to 128GB of much slower but much more forgiving shared memory.

That distinction matters more than CPU choice, motherboard branding, or theoretical TFLOPS.

1. The core tradeoff

Your proposed desktop would be much faster when the model fits well in GPU memory. But it does not behave like a 128GB unified-memory machine.

Practical comparison

Workload	2×RTX 3090 desktop	128GB Strix Halo
7B–14B coding models	Excellent; one 3090 is already enough	Works, but usually slower
30B–34B coding models	Excellent daily-use zone	Works
70B Q4	Strong target for dual 3090s	Works, usually slower
80B Q4	Possible, but tight; context/KV cache matter a lot	One of the Halo’s better use cases
Dense 100B/120B Q4	Not a clean fit in 48GB VRAM	Much more forgiving
Long-context coding	Good until KV cache eats VRAM	More forgiving due to capacity
Gaming	Far better	Impressive for an APU, but not comparable
CUDA ecosystem	Strong	Not the same experience
Repairability	Standard desktop parts	Worse
Power/heat/noise	Much worse	Much better
“Just load huge quants and experiment”	More tuning required	Better

So the decision is not “which machine is faster?” The 3090 desktop is faster when it fits.

The real decision is:

Do you mostly want speed for models that fit in 24–48GB VRAM, or do you want capacity/flexibility for models that do not?

2. What Strix Halo is good at

AMD lists the Ryzen AI Max+ 395 with:

up to 128GB memory,
256-bit LPDDR5x memory,
up to LPDDR5x-8000 memory speed,
Radeon 8060S integrated graphics,
40 GPU cores,
up to 2900MHz graphics frequency.

Source: AMD Ryzen AI Max+ 395 official specs

Independent Strix Halo LLM testing has reported around 215GB/s observed GPU memory bandwidth from a 256GB/s theoretical LPDDR5x-8000 memory subsystem.

Source: Strix Halo LLM benchmark discussion/results

That bandwidth is much lower than RTX 3090 VRAM bandwidth, but the point of Strix Halo is not peak bandwidth. Its value is that it gives you a relatively large shared memory pool in a compact box.

That makes it good for:

large GGUF experiments,
80B-class quants,
some 100B/120B-class low-bit experiments,
long-context tests,
“will this load?” experimentation,
running a large local model without building a hot dual-GPU tower.

The Halo’s weakness is also clear:

it is slower than real high-end VRAM,
ROCm/Vulkan/AMD local-LLM workflows can be more limited than CUDA,
gaming is not comparable to a 3090 desktop,
repairability and upgradeability are worse,
you cannot just add another GPU or change the memory later.

3. What 2×RTX 3090 is good at

NVIDIA lists RTX 3090 as a 24GB GDDR6X Ampere card with 10,496 CUDA cores.

Source: NVIDIA RTX 3090 official specs

Board-partner specs commonly list RTX 3090 with:

24GB GDDR6X,
384-bit memory bus,
around 936GB/s memory bandwidth,
high board power,
optional NVLink support on many 3090 cards.

Example source: Gigabyte RTX 3090 Gaming OC specs

Two 3090s give you:

24GB VRAM on GPU 0,
24GB VRAM on GPU 1,
48GB total installed VRAM,
very high bandwidth per card,
CUDA,
strong gaming performance,
much better software support for many inference stacks.

But two 3090s do not automatically become one transparent 48GB GPU.

That means dual 3090s are excellent when:

the model can be split cleanly,
the model and KV cache mostly stay in VRAM,
the backend supports multi-GPU well,
PCIe layout is sane,
cooling and power are handled properly.

They are less excellent when:

the model spills heavily to CPU RAM,
the context is long enough to blow up KV cache,
the frontend hides important split/offload knobs,
one card is thermally throttling,
the second GPU is behind a weak chipset link,
you expect dense 120B Q4 to behave like it is on a 128GB accelerator.

4. Why 48GB VRAM is not the same as 128GB shared memory

This is the most important point.

A 2×3090 desktop has two memory tiers:

Fast tier: 24GB + 24GB GDDR6X VRAM.
Slow tier: system DDR5 RAM.

The slow tier can help you load/offload models, but once inference depends heavily on system RAM, the 3090 advantage drops sharply.

Intel’s i7-13700K spec lists:

2 memory channels,
up to DDR5-5600 official memory speed,
89.6GB/s max memory bandwidth.

Source: Intel Core i7-13700K specifications

That is perfectly fine for a gaming/dev desktop, but it is far below RTX 3090 VRAM bandwidth and also below the reported Strix Halo GPU-accessible memory bandwidth.

So the desktop does not behave like:

48GB fast VRAM + 128GB RAM = 176GB medium-speed model memory.

It behaves more like:

48GB fast accelerator memory, then a big performance cliff when you spill.

That is why 2×3090 is attractive for 70B Q4, borderline for 80B Q4, and not a clean dense-120B solution.

5. Model-size reality

Quantization reduces memory use by storing weights in lower precision. Hugging Face’s Transformers docs describe quantization as a way to reduce memory and compute cost using lower-precision weights/activations, and note support for AWQ, GPTQ, and 8-bit/4-bit bitsandbytes quantization.

Source: Hugging Face Transformers quantization docs

But inference memory is not only model weights. You also need memory for:

KV cache,
activations,
framework/runtime buffers,
temporary prompt-processing memory,
fragmentation/allocator overhead.

BentoML’s LLM memory guide explicitly calls out model weights, activations, and KV cache as major runtime memory consumers.

Source: BentoML: GPU memory and LLM inference

Rough planning table

These are not exact numbers. They are planning estimates.

Model size	FP16/BF16 weights	8-bit weights	ideal 4-bit weights	Practical Q4-ish estimate	2×3090 judgment
7B	~14GB	~7GB	~3.5GB	~4–6GB	Easy
14B	~28GB	~14GB	~7GB	~8–12GB	Easy
30B–34B	~60–68GB	~30–34GB	~15–17GB	~18–24GB	Good
70B	~140GB	~70GB	~35GB	~38–45GB	Good target
80B	~160GB	~80GB	~40GB	~44–52GB	Borderline
120B	~240GB	~120GB	~60GB	~66–80GB+	Not clean

The key part is not the weight-only size. The key part is the remaining VRAM after weights.

For coding, context length often matters a lot because prompts may include:

multiple files,
stack traces,
logs,
failing tests,
previous conversation,
dependency details,
generated patches.

A model that fits at 4k context may become unpleasant at 16k or 32k because the KV cache grows.

6. Model-by-model practical judgment

7B–14B

A 3090 desktop wins easily.

One RTX 3090 is already enough. The second card is mainly useful for:

running another model simultaneously,
keeping embeddings/rerankers/vision on another GPU,
running a game or other GPU task separately,
experimenting without evicting the daily model.

For this model size, Strix Halo is convenient but not the speed winner.

30B–34B

This may be the best daily local-coding zone.

A strong 30B/32B/34B coding model can be more useful in real work than a slow giant model. On a 3090 desktop, this class can be fast enough for interactive use and high-quality enough for code review, debugging, refactoring, and test generation.

If your daily work is coding, this size range may matter more than 120B.

70B Q4

This is the best argument for 2×3090.

A 70B Q4 model is large enough to benefit from the second GPU, but usually not so large that 48GB is hopeless. If weights and KV cache stay mostly in VRAM, dual 3090s should beat Strix Halo by a lot.

Expected behavior:

4k context: likely good.
8k context: likely practical.
16k context: backend/quant/KV-cache settings matter.
32k context: be cautious.
GGUF/llama.cpp: practical.
vLLM/SGLang: attractive if using supported model formats.
Ollama: convenient, but less transparent.

If your real target is fast 70B Q4 coding inference, selling the Halo and building the desktop is rational.

80B Q4

This is the dangerous middle.

An 80B Q4 model can be close enough to 48GB that small details decide everything:

Q4 format,
KV cache precision,
context length,
backend overhead,
split strategy,
batch size,
whether the frontend hides useful knobs.

My practical take:

2×3090 can be an 80B Q4 machine, but I would not call it a comfortable 80B Q4 machine until you test your exact models and context lengths.

This is where the Halo’s value becomes obvious. The Halo may be slower, but it gives you more room to be sloppy.

Dense 100B/120B Q4

This is where dual 3090s stop being a clean replacement.

A dense 120B Q4 model generally wants more than 48GB once you include realistic overhead and KV cache. You can still experiment with:

lower-bit quants,
shorter context,
CPU offload,
MoE models,
llama.cpp tuning,
vLLM/SGLang where supported,
more GPUs,
larger-VRAM professional cards.

But if your goal is:

“I want to casually load dense 120B Q4 models and use them with useful coding context.”

Then 2×3090 is the wrong comfort zone.

The Halo is not necessarily fast here, but it is more forgiving.

MoE models

Mixture-of-Experts models are a separate case.

A huge-looking MoE model may activate only part of the total parameter count per token. That can make some large MoE models much more practical than dense models with the same total parameter count.

But MoE performance is backend-sensitive. Expert placement, active parameters, CPU offload, and runtime support matter a lot. Do not judge MoE models purely by the headline parameter count.

7. Motherboard: is Z690 Taichi good enough?

Yes, electrically it is a reasonable dual-3090 platform.

ASRock lists the Z690 Taichi with:

support for 12th/13th/14th gen Intel Core processors,
4 DDR5 DIMM slots,
dual-channel DDR5,
two PCIe 5.0 x16 physical slots,
support for x16 or x8/x8 operation on the primary PCIe slots,
an additional chipset-connected long slot.

Source: ASRock Z690 Taichi specifications

The important part is the x8/x8 support for the two main GPU slots.

For two RTX 3090s, this is much better than a board where the second long slot is only chipset x4.

PCIe x8/x8 is acceptable

RTX 3090 is PCIe 4.0. PCIe 4.0 x8 is not the same as on-card VRAM bandwidth, but it is a reasonable consumer dual-GPU topology.

The bigger problems are usually:

VRAM capacity,
backend split behavior,
GPU-to-GPU communication,
cooling,
power,
card spacing.

Physical spacing matters more than people expect

Two 3090s can be physically difficult.

Check:

card thickness,
slot spacing,
whether the top card can breathe,
case bottom clearance,
front-panel connector conflicts,
PSU cable routing,
NVLink bridge spacing if you care,
memory junction temperatures under sustained load.

A dual-GPU build can look valid on paper and still be unpleasant because the top card suffocates.

8. CPU: i7 or i9?

I would not overspend on the CPU before fixing GPU, RAM, PSU, and cooling.

A modern i7/i9 is fine. The i7-13700K class is already strong for:

gaming,
compiling,
IDE work,
Docker/WSL,
running local services,
feeding the GPUs,
occasional CPU-offload fallback.

The i9 is fine if it is cheap enough or useful for your non-LLM workload. But it does not solve the real problem: insufficient VRAM for 100B/120B dense models.

Intel lists the i7-13700K with 2 memory channels and 89.6GB/s max memory bandwidth, which is the same basic memory-channel class as other LGA1700 desktop parts.

Source: Intel Core i7-13700K specifications

Why not old Threadripper?

Old Threadripper gives you more memory channels and PCIe lanes, but it is not automatically better for your use case.

It can be worse for:

single-core performance,
gaming,
idle power,
platform age,
motherboard availability,
daily desktop feel.

More memory channels help CPU offload, but CPU offload is already the fallback path. If the goal is fast LLM inference, the usual answer is more usable accelerator memory, not simply more CPU memory bandwidth.

I would only choose Threadripper/EPYC/Threadripper Pro if you are building a real multi-GPU workstation/server with 3–4 GPUs, lots of RAM, and less concern for gaming.

9. RAM: 64GB is not enough

For this machine, I would treat 128GB system RAM as the practical minimum.

Not because system RAM is fast enough to replace VRAM. It is not.

You want 128GB because your machine will also run:

OS,
browser,
IDE,
Docker,
WSL,
model server,
vector DB / RAG tools,
build tools,
test suites,
CPU offload,
large model loading,
multiple model files,
failed experiments that allocate too much memory.

RAM choices

RAM amount	Practical view
32GB	Too low for this project
64GB	Fine for gaming/dev, weak for large-model experiments
96GB	Reasonable compromise with 2×48GB kits
128GB	Best baseline
192GB	Interesting if the board/BIOS/kit are proven stable

ASRock’s Z690 Taichi page lists DDR5 support and modern capacity support, but with any high-capacity DDR5 setup you should verify the board QVL, BIOS version, and real-world stability.

Source: ASRock Z690 Taichi specs

For your purpose, I would rather have stable 128GB than unstable 192GB.

10. PSU and cooling

This part is not optional.

A single RTX 3090 is already a high-power card. Two of them plus a high-end CPU can be a serious sustained load.

Some board-partner 3090 cards list around 350–370W board power and recommend a 750W PSU for a single-card system.

Example source: MSI RTX 3090 Gaming Trio datasheet

PSU recommendation

I would use:

1200W minimum if power-limiting/undervolting,
1600W preferred if using an i9, high sustained load, or less tuning.

Also:

use separate PCIe power cables where possible,
avoid cheap splitters,
avoid questionable old PSUs,
leave transient headroom,
consider power-limiting both 3090s.

For inference, power-limiting is often worthwhile. You may lose less performance than expected while reducing heat, noise, and instability.

Cooling recommendation

Two open-air 3090s can be difficult.

Plan for:

a large airflow case,
strong front intake,
top/rear exhaust,
enough spacing between cards,
GPU memory junction temperature monitoring,
possible thermal pad maintenance,
sane fan curves,
no “sandwiched with no intake” layout.

Sustained inference is not like a short gaming benchmark. It can keep cards hot for a long time.

11. NVLink: useful, not magic

RTX 3090 is one of the few GeForce cards with NVLink support. Puget Systems notes that GeForce RTX 3090, RTX A6000, and RTX A5000 use the newer NVLink bridge generation, not the old 20-series/Quadro RTX bridge generation.

Sources:

But NVLink does not mean:

every backend sees one 48GB GPU,
VRAM is universally pooled,
120B dense Q4 becomes comfortable,
no software tuning is needed,
PCIe layout no longer matters.

Treat NVLink as optional.

Buy it only if:

both 3090s support it,
the bridge spacing matches,
it is not expensive,
your backend can benefit from peer access / NCCL / tensor-parallel communication.

Do not design the whole system around NVLink. Cooling, power, and VRAM capacity matter more.

12. Backend choice changes the answer

The same hardware can feel excellent or disappointing depending on backend.

Ollama

Ollama is convenient and has improved multi-GPU behavior. The Ollama model-scheduling update says it improved memory management, reduced OOMs, increased GPU utilization, and improved multi-GPU/mismatched-GPU performance.

Sources:

Ollama is good for:

daily convenience,
local API use,
simple model management,
small/medium models,
70B Q4 tests.

But I would not use only Ollama to decide whether to sell the Halo. It may not expose enough control for borderline 80B/120B testing.

llama.cpp / GGUF direct

This is probably the most important stack if you use GGUF.

llama.cpp has explicit multi-GPU controls such as split modes and tensor split options.

Source: llama.cpp multi-GPU guide

Useful things to test:

--n-gpu-layers
--split-mode
--tensor-split
--main-gpu
--ctx-size
KV cache type
Flash attention
prompt processing speed
generation speed

Example benchmark structure:

CUDA_VISIBLE_DEVICES=0,1 ./llama-bench \
  -m /models/<model>.gguf \
  -p 512,4096,8192,16384 \
  -n 256 \
  -ngl 999

Example layer split:

CUDA_VISIBLE_DEVICES=0,1 ./llama-cli \
  -m /models/<model>.gguf \
  --n-gpu-layers 999 \
  --split-mode layer \
  --tensor-split 24,24 \
  --ctx-size 8192 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -p "<real coding prompt>"

Example tensor split:

CUDA_VISIBLE_DEVICES=0,1 ./llama-cli \
  -m /models/<model>.gguf \
  --n-gpu-layers 999 \
  --split-mode tensor \
  --tensor-split 24,24 \
  --ctx-size 8192 \
  -p "<same real coding prompt>"

For 2×3090, llama.cpp direct testing is more informative than just trying a desktop frontend.

vLLM

vLLM is attractive if you are willing to use a server-style inference stack.

Its docs say that for multi-GPU inference, you set tensor_parallel_size to the desired GPU count.

Source: vLLM parallelism and scaling docs

Example:

CUDA_VISIBLE_DEVICES=0,1 vllm serve <model-id-or-path> \
  --tensor-parallel-size 2 \
  --max-model-len 8192

vLLM is good if:

you use OpenAI-compatible local endpoints,
you want throughput,
you run multiple requests,
you use Hugging Face model formats,
you want explicit tensor parallelism.

But vLLM is not primarily a GGUF desktop-app workflow. Its GGUF docs warn that GGUF support is highly experimental and under-optimized.

Source: vLLM GGUF docs

SGLang

SGLang is another serious local serving option. Its docs say to enable multi-GPU tensor parallelism with --tp 2.

Source: SGLang server arguments

Example:

CUDA_VISIBLE_DEVICES=0,1 python -m sglang.launch_server \
  --model-path <model-id-or-path> \
  --tp 2

SGLang is worth testing if you want a real local inference server and not just a desktop frontend.

13. What I would buy/change

For your stated goal, I would prioritize in this order:

Second RTX 3090 24GB
Motherboard with real CPU-lane x8/x8
128GB DDR5
High-quality 1200W–1600W PSU
Large airflow case
Cooling/thermal-pad/fan plan
Large NVMe storage for models
Optional NVLink bridge
CPU upgrade only after the above is solved

CPU

I would choose:

i7-13700K/KF-class if value matters,
i9-13900K/KF-class only if close in cost or useful outside LLMs,
avoid old Threadripper unless building a dedicated multi-GPU workstation/server.

The CPU is not the main local-LLM bottleneck once the model is GPU-resident.

RAM

I would choose:

128GB DDR5 as baseline,
stability over headline XMP speed,
board-QVL/BIOS-verified kits where possible.

Do not stop at 64GB for this use case.

Motherboard

The Z690 Taichi idea is reasonable because it has the right kind of dual-GPU slot layout. It is not magical, but it is not a bad starting point.

The main question is physical fit and thermals, not whether x8/x8 is theoretically enough.

PSU

I would not build dual 3090s on a marginal PSU.

Use:

1200W minimum if tuned,
1600W if you want comfort.

Case

A dual-3090 build can fail because of airflow even when every spec looks correct. Check the exact GPU cooler sizes before buying the case or board.

14. Should you sell the Halo?

Sell it if your real workload is:

7B–14B fast local assistants,
30B–34B coding models,
70B Q4,
occasional 80B testing,
CUDA experiments,
gaming,
wanting one repairable main PC.

In that case, the 2×3090 desktop is a better fit.

Keep it if your real workload is:

dense 80B/100B/120B experiments,
long-context coding,
large GGUF models,
“load first, optimize later” testing,
a quiet compact inference box,
a second local endpoint.

In that case, the Halo still has a clear role.

Best hybrid workflow

The strongest reason to keep both is not that they combine into one giant machine. It is that they can serve different roles:

3090 desktop: fast small/medium/70B models, CUDA, gaming.
Strix Halo: larger slower models, long-context experiments, large-memory fallback.

That can be very useful for coding. You can use a fast local model for iteration and keep a larger model available for deeper or slower checks.

But if two personal machines plus a work laptop is too much, and the Halo resale window is unusually good, selling it is rational.

15. Benchmark before selling

Do not benchmark only one model and one prompt.

Test at least:

Test	Why
14B coding model	Low-latency daily assistant
30B–34B coding model	Best quality/speed daily zone
70B Q4	Main dual-3090 target
80B Q4	Borderline decision point
100B/120B or large MoE	Decides whether Halo still matters

Test these context sizes:

Context	Why
4k	Basic chat/coding
8k	Realistic coding context
16k	Multi-file work
32k	Stress test

Record:

prompt processing tokens/sec,
generation tokens/sec,
time to first token,
VRAM on GPU 0,
VRAM on GPU 1,
system RAM usage,
whether CPU offload occurs,
GPU utilization,
GPU temperature,
memory junction temperature,
power draw if possible.

The key question is not:

“Can it technically run?”

The key question is:

“Does it remain pleasant with the prompts I actually use for coding?”

A 70B model that feels responsive is often more useful than a 120B model you avoid using because it is slow and fragile.

16. Final recommendation

I would build the 2×3090 desktop if your goal is:

fast local coding inference, good gaming, CUDA, repairability, and one main PC.

I would keep the Halo if your goal is:

frequent dense 80B/100B/120B experiments, long-context tests, and large-memory convenience.

My concrete answer:

Z690 Taichi + modern i7/i9 + 128GB DDR5 + 2×RTX 3090 is realistic and sensible for 7B–70B local coding inference.

It is a borderline 80B Q4 machine.

It is not a comfortable dense 120B Q4 machine.

If you sell the Halo, sell it with the correct expectation:

You are trading large-memory flexibility for much higher speed, better gaming, CUDA, and repairability.

That is a reasonable trade if 70B and below are your real daily targets.

It is a bad trade if the Halo’s main value to you is casually loading huge 100B/120B-class quants that do not fit neatly into 48GB VRAM.

Short summary

2×RTX 3090 is excellent for 7B–70B.
80B Q4 is possible but needs testing.
Dense 120B Q4 is not a clean fit in 48GB VRAM.
Z690 Taichi-style x8/x8 is acceptable.
Use 128GB RAM, not 64GB.
Prioritize PSU, cooling, and GPU spacing over i9 bragging rights.
Do not buy old Threadripper unless you want a true multi-GPU workstation/server.
NVLink is optional, not magic.
Ollama is convenient; llama.cpp direct is better for testing; vLLM/SGLang are better if you accept server-style inference.
Sell the Halo if your daily use is fast 30B/70B coding plus gaming.
Keep the Halo if large-memory 80B/100B/120B experiments are genuinely frequent.

GotLLMs · May 14, 2026, 1:35pm

Thanks! I’ve read all that:) Just like I have read tons of text like yours on the subject this from diferent silicon minds, and they all do not have similar opinions.

Right now my Halo is on windows/ollama, I guees I’ll go Ubuntu/Llama.cpp and try to get the best speeds and learn how to benchmark, than I’ll rent a server with 2x3090 and compare results.

GotLLMs · May 19, 2026, 1:50pm

So I rented a server with double 3090 and tried ro run some models. Picked a MoE one that gets offloaded and a dense one that does not.

Results (output tokens):

Qwen3.6-27B-Q8_0 (fits in 3090s):

- Halo: 7.8 t/s

- 2x3090: 24 t/s

gpt-oss-120b-Q4_K_M (does not fit in 3090s, gets offloaded):

- Halo: 56 t/s

- 2x3090: 8.8 t/s

Somehow this experiment did not make the choice clearer. I see people online posting way better results for gpt-oss on 2x3090s, maybe I didn’t know how to run it well.

I ran it with

root@vm6388:~#   ./llama.cpp/build2/bin/llama-cli \

  -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \

  -c 128000 \

  -fa on \

  -ngl 23 \

  -sm row \

  -ts 1,1

Also since the rental was a VM I wasn’t able to see the mobo and memory channel count, just the CPU Xeon Gold 6246.

I have a feeling that I can replace the Halo with 2x 3090s with right tweaking. Am I right?

John6666 · May 21, 2026, 12:41am

Hmm…

There are two different questions mixed together here:

Can 2x3090s beat Strix Halo for dense coding models?
Can 2x3090s replace Strix Halo for gpt-oss-120b, especially at large context?

For the first one, your own result already says “probably yes”.

For the second one, I think the current test is not enough to say that.

Your Qwen result looks normal to me:

Qwen3.6-27B-Q8_0
Halo:   7.8 tok/s
2x3090: 24 tok/s

That is the easy case. A dense-ish model fits in fast NVIDIA VRAM, so the 3090 box wins. I would expect that for many 20B-34B coding models, and probably many 70B Q4 cases too, depending on context/KV.

But I would not read the gpt-oss-120b result the same way:

gpt-oss-120b-Q4_K_M
Halo:   56 tok/s
2x3090: 8.8 tok/s

gpt-oss-120b is not a normal dense 120B model. The model card says it is 117B total parameters, but only 5.1B active parameters per token. It is a MoE model, and the MoE weights are MXFP4. It is described as fitting on a single 80GB GPU.

That changes the problem.

For a dense model, a rough mental model is:

Can I keep most/all weights in fast VRAM?
If yes, NVIDIA dGPU probably wins.

For a big MoE model, the mental model is more like:

Which tensors are always active?
Which routed experts are active only some of the time?
Which parts are on GPU?
Which parts are on CPU/system memory?
How much KV cache is allocated?
How much PCIe traffic happens per token?
Does the backend understand this placement well?

That is why I would be cautious here. Strix Halo is not beating 3090 VRAM on raw bandwidth. It is much slower than GDDR6X in that sense. But it has a large unified memory pool. For a huge MoE model with offload-like behavior, that can matter more than the simple VRAM bandwidth comparison suggests.

So I would summarize your current numbers this way:

Dense model that fits in fast VRAM:
  2x3090 wins hard. Expected.

Huge MoE model with large memory footprint:
  not obvious. Halo may be a very good fit.

Your current 2x3090 gpt-oss result:
  probably not a fair upper bound yet.

The main reason I would not trust the 8.8 tok/s number as the final answer is the command:

./llama.cpp/build2/bin/llama-cli \
  -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  -c 128000 \
  -fa on \
  -ngl 23 \
  -sm row \
  -ts 1,1

Several things there make the 2x3090 setup look worse than it might be.

First:

-sm row

The current llama.cpp multi-GPU docs describe row as deprecated. They describe layer as the default and most compatible mode, and tensor as experimental but intended for lower token-generation latency where the model/backend/interconnect cooperate.

github.com/ggml-org/llama.cpp

docs/multi-gpu.md

master

# Using multiple GPUs with llama.cpp

This guide explains how to run [llama.cpp](https://github.com/ggml-org/llama.cpp) across more than one GPU. It covers the split modes, the command-line flags that control them, the limitations you need to know about, and ready-to-use recipes for `llama-cli` and `llama-server`.

The CLI arguments listed here are the same for both tools - or most llama.cpp binaries for that matter.

---

## When you need multi-GPU

Reach for multi-GPU when one of these is true:

- **The model doesn't fit in a single GPU's VRAM.** By spreading the weights across two or more GPUs the whole model can stay on accelerators. Otherwise part of the model will need to be run off of the comparatively slower system RAM.
- **You want more throughput.** By distributing the computation across multiple GPUs, each individual GPU has to do less work. This can result in better prefill and/or token generation performance, depending on the split mode and interconnect speed vs. the speed of an individual GPU.

---

## The split modes

Set with `--split-mode` / `-sm`.

This file has been truncated. show original

So I would not use row as the baseline for deciding whether the 2x3090 machine can replace the Halo.

Second:

-ngl 23

That is not a “try to use as much VRAM as possible” setting. It limits how many layers are offloaded to GPU. For a first baseline I would try:

--n-gpu-layers 999

or:

--n-gpu-layers all

and only then back down if it does not fit.

Third:

-c 128000

That is a brutal starting point for 48GB total VRAM. It means you are not only testing model throughput. You are also testing huge KV cache pressure, offload behavior, and memory placement all at once.

I would not start at 128k context. I would sweep context size:

-c 8192
-c 16384
-c 32768
-c 65536
-c 128000

If the 2x3090 setup is fine at 8k/16k/32k and then collapses at 64k/128k, that tells you something very different from “2x3090 is slow”.

Fourth, the rental VM is a big unknown.

For multi-GPU inference, topology can matter a lot:

PCIe layout
P2P availability
NCCL availability
NUMA placement
CPU memory bandwidth
virtualization overhead

On a rented VM, you may not know whether the two 3090s are attached in a sane way. I would at least check:

nvidia-smi topo -m
./llama-cli --list-devices

and watch the llama.cpp logs for things like:

NCCL is unavailable, multi GPU performance will be suboptimal

or any sign that much more of the model is on CPU than expected.

The other important point is that gpt-oss-120b should probably be treated as a MoE placement problem, not just a normal -ngl problem.

This guide explains the general idea well:

For MoE offload, the interesting idea is:

Always-active tensors:
  use them every token
  highest priority to keep on GPU

Routed experts:
  huge part of the model
  only a subset used per token
  may be more reasonable to offload partly to CPU/system memory

So for gpt-oss-120b, I would test MoE-aware options if your llama.cpp build supports them:

--cpu-moe

or sweep:

--n-cpu-moe 32
--n-cpu-moe 30
--n-cpu-moe 28
--n-cpu-moe 26
--n-cpu-moe 24

I would not assume those are the correct values. I would sweep and measure.

A more useful first baseline might look like this:

./llama-cli \
  -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  -c 8192 \
  -fa on \
  --n-gpu-layers 999 \
  --split-mode layer \
  --tensor-split 1,1

Then test MoE placement:

./llama-cli \
  -m /root/gpt-oss-120b-Q4_K_M-00001-of-00002.gguf \
  -c 8192 \
  -fa on \
  --n-gpu-layers 999 \
  --split-mode layer \
  --tensor-split 1,1 \
  --n-cpu-moe 28

Then sweep:

--n-cpu-moe 32
--n-cpu-moe 30
--n-cpu-moe 28
--n-cpu-moe 26
--n-cpu-moe 24

Then raise context:

-c 8192
-c 16384
-c 32768
-c 65536
-c 128000

If tensor split is supported for this model/build, I would test it too, but separately:

--split-mode tensor

I would not mix all variables at once.

My preferred order would be:

1. layer split, small context, max GPU layers
2. layer split, small context, MoE offload sweep
3. layer split, context sweep
4. tensor split, small context
5. tensor split, MoE offload sweep
6. tensor split, context sweep

And I would measure prompt processing and token generation separately.

For coding use, both matter:

Prompt processing:
  big prompts, repo snippets, logs, RAG, long context

Token generation:
  how fast the answer streams back

If one setup has great token generation but terrible prompt processing, it may feel bad for coding with long files. If another setup has slower generation but handles large context smoothly, it may be better for actual work.

So I would benchmark something like:

same prompt
same context size
same output token count
same sampling
same llama.cpp commit
same quant
same backend
same command except the variable under test

For example:

# 8k baseline
-c 8192

# 32k realistic coding/RAG-ish baseline
-c 32768

# 128k stress test
-c 128000

I would also separate these model classes:

Class A:
  dense model that fits in one 3090

Class B:
  dense model that needs both 3090s but mostly stays in VRAM

Class C:
  huge MoE / offload-heavy / high-context model

Your results already show why this matters.

For Class A / B, 2x3090 is likely excellent.

For Class C, Strix Halo may be surprisingly good.

That is the real point here. The systems are not exact substitutes.

The 3090 box is basically:

fast VRAM
CUDA ecosystem
great for dense models
great for gaming
great for smaller/faster coding models
annoying when the model exceeds VRAM
multi-GPU complexity when using both cards

The Halo box is basically:

much larger unified memory
less raw bandwidth than 3090 VRAM
much easier for very large models
possibly very good for MoE/offload-heavy workloads
not as strong for dense models that fit in NVIDIA VRAM

That matches your numbers.

So if your real daily workload is:

Qwen / Llama / DeepSeek Coder style dense models
20B-34B
70B Q4
gaming
CUDA tools

then I would lean toward the 3090 box.

If your real daily workload is:

gpt-oss-120b
large context
MoE experiments
"make the huge model fit without fighting placement all day"

then I would keep the Halo unless a tuned 2x3090 test proves otherwise.

If you can keep both, I think the cleanest split is:

Halo:
  gpt-oss-120b
  huge context
  MoE/offload experiments
  large-memory local inference

3090 box:
  dense coding models
  fast small/medium models
  CUDA backends
  gaming

If you want to replace the Halo, I would want the 2x3090 box to pass a more controlled gpt-oss-120b test first.

Something like:

No row split
Smaller context first
--n-gpu-layers 999/all
MoE offload sweep
Topology check
Prompt processing and generation measured separately
Same prompt / output length / sampling
Bare metal if possible, not unknown rental VM topology

If after that the 2x3090 setup gets close to or beats the Halo in your real gpt-oss-120b use case, then replacing the Halo becomes much more reasonable.

If it still loses badly, then I would not treat that as a surprise. It would just mean that gpt-oss-120b is landing in the exact niche where Strix Halo’s large unified memory is useful.

One more way to phrase it:

2x3090 is probably the better dense-model machine.

Strix Halo may be the better "large weird model" machine.

gpt-oss-120b is a large weird model.

So my answer to your original question would be:

Yes for many coding models, not proven for gpt-oss-120b.

And based on the numbers you posted, I would not sell the Halo yet.

Topic		Replies	Views
Local LLM and ML platform with RTX 5090 GPU Show and Tell	5	3243	September 19, 2025
Buying advice local llm Beginners	1	3140	March 28, 2026
Feature Suggestion! running large gguf models! Inference Endpoints on the Hub	0	568	December 3, 2023
Running Modern AI Image Models on a GTX 1060 6GB — A Practical Guide Beginners	0	67	May 18, 2026
RAM usage, Model streaming or alternatives Beginners	4	1035	March 1, 2026