I’m building an Agentic AI application with very limited hardware: GTX 1660 Super (Turing, 6GB VRAM). I plan to run a single LLM per agent (not multiple models simultaneously) to stay within VRAM limits.
What I’ve tried so far:
-
llama-3.2-3b-instruct(4-bit) → poor results -
SmolLM3-3B(no quantization) → good results but saturates 6GB VRAM, nothing left for computation -
SmolLM3-3B(4-bit) → better than Llama, but still not good enough for my needs -
Planning to test
Qwen3-4B-ThinkingandPhi-3-mini-128k-instructnext
My problem: All these models are multilingual. That’s overkill for my use case. I suspect those extra language capabilities waste parameter capacity and VRAM that could otherwise improve English performance or reduce model size.
My request: Can you recommend a 2B–4B parameter LLM that is English-only (or max 2–3 languages) and works well with 4-bit or 8-bit quantization on 6GB VRAM? I’m looking for something that prioritizes English instruction-following, reasoning, and agentic tasks (tool use, planning, memory) over multilingual coverage.
Bonus points if:
-
The model is known to be quantization-friendly (GPTQ, AWQ, or llama.cpp compatible)
-
There are quantized versions available on HF already
-
It has good benchmark scores (MMLU, GSM8K) compared to SmolLM3 or Llama-3.2-3B
What I don’t need:
-
Translation capabilities
-
Support for non-Latin scripts
-
Massive vocabulary covering rare Unicode characters
Thank you!