Need English-only (or minimal multilingual) 2B-4B LLM for Agentic AI on GTX 1660 Super (6GB VRAM) – quantization friendly

azhak1 · May 15, 2026, 5:46pm

I’m building an Agentic AI application with very limited hardware: GTX 1660 Super (Turing, 6GB VRAM). I plan to run a single LLM per agent (not multiple models simultaneously) to stay within VRAM limits.

What I’ve tried so far:

llama-3.2-3b-instruct (4-bit) → poor results
SmolLM3-3B (no quantization) → good results but saturates 6GB VRAM, nothing left for computation
SmolLM3-3B (4-bit) → better than Llama, but still not good enough for my needs
Planning to test Qwen3-4B-Thinking and Phi-3-mini-128k-instruct next

My problem: All these models are multilingual. That’s overkill for my use case. I suspect those extra language capabilities waste parameter capacity and VRAM that could otherwise improve English performance or reduce model size.

My request: Can you recommend a 2B–4B parameter LLM that is English-only (or max 2–3 languages) and works well with 4-bit or 8-bit quantization on 6GB VRAM? I’m looking for something that prioritizes English instruction-following, reasoning, and agentic tasks (tool use, planning, memory) over multilingual coverage.

Bonus points if:

The model is known to be quantization-friendly (GPTQ, AWQ, or llama.cpp compatible)
There are quantized versions available on HF already
It has good benchmark scores (MMLU, GSM8K) compared to SmolLM3 or Llama-3.2-3B

What I don’t need:

Translation capabilities
Support for non-Latin scripts
Massive vocabulary covering rare Unicode characters

Thank you!

azhak1 · May 15, 2026, 6:28pm

Thank you so much @Uzer-namo-2024

It’s a great help and some new interesting stuff for me which would be fun to do.

I’ll try and share my feedback here.

John6666 · May 15, 2026, 9:12pm

I also personally recommend these models:

Topic		Replies	Views
Best LLMs that can run on 4gb VRAM Beginners	2	11475	January 22, 2025
Find LLM to run on single gpu with only 8 GB ram Models	10	9558	March 22, 2024
How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM? Beginners	2	2253	September 26, 2024
Best Open-Source Model for Agentic Apps with CrewAI Beginners	2	1708	May 11, 2025
Local HW specs for Hosting meta-llama/Llama-3.2-11B-Vision-Instruct 🤗Transformers	4	2096	October 28, 2024

Need English-only (or minimal multilingual) 2B-4B LLM for Agentic AI on GTX 1660 Super (6GB VRAM) – quantization friendly

Related topics