Current State and Future of "Integer-Only" LLM Inference (Non-Floating Point)

rorungbin · April 13, 2026, 11:30am

Hi everyone,

I’ve been following the developments in LLM quantization (GPTQ, AWQ, GGUF, etc.), but I noticed that most of these still rely on floating-point operations (FP16/BF16) for certain parts of the inference, like activations or non-linear layers (LayerNorm, Softmax).

I’m particularly interested in “Integer-Only Inference”—where the entire computational graph runs on INT8 or even lower, without any de-quantization to FP during the process. I believe this would be a game-changer for deploying LLMs on edge devices or specialized integer-only accelerators (like certain DSPs or older NPUs).

I have a few questions for the community:

Current Trends: Are there any production-ready frameworks or specific model architectures that support fully integer-only inference right now? I’ve seen papers like I-LLM (2024), but how close are we to seeing this integrated into libraries like transformers or vLLM?
Accuracy vs. Efficiency: For those who have experimented with W8A8 or W4A4 integer-only paths, how significant is the accuracy degradation compared to traditional FP16/INT8 hybrid methods?
Future Outlook: Do you think “Integer-only” will become a mainstream standard, or will the industry lean more towards newer formats like FP8 or MX (Microscaling) formats which are supported by the latest Blackwell/Hopper architectures?

I’d love to hear from anyone working on quantization kernels or edge deployment. Any insights or links to recent repositories would be greatly appreciated!

John6666 · April 14, 2026, 12:42pm

For now, here’s a rundown of the major options available:

Of the options below, I’ve actually tried BitNet myself in the past. It’s very compact, fast, and produces decent results. However—and this isn’t limited to BitNet—since floating-point calculations are simpler and faster when relying on GPU hardware support, native integer calculations aren’t widely adopted in mainstream frameworks. That said, it might be possible to fine-tune a model using a major framework and then transfer it to an integer-based framework. BitNet also has Llama.

Here is the current picture as of April 2026.

Integer-only LLM inference is real, but it is not the mainstream default yet. Most popular “quantized LLM inference” today is still one of these instead: weight-only quantization, mixed weight/activation quantization, or low-precision floating-point serving such as FP8 or FP4-family formats. The hard part is not just quantizing linear layers. It is also keeping normalization, softmax, scaling, and dynamic range management inside a low-bit path without losing too much accuracy or giving the speedup back through dequantization overhead. That is exactly the gap papers like I-BERT and I-LLM were written to close. (Hugging Face)

Why this is harder than it looks

A lot of discussions mix up three different things.

First, there is quantized storage: the checkpoint is saved in INT4, INT8, GGUF, or another compact format. Second, there is quantized compute for the heavy GEMMs: the large matrix multiplications use low-bit kernels, but some other operators still run in FP16/BF16/FP32. Third, there is true integer-only inference: the whole forward graph, including rescaling, normalization, and softmax-like pieces, stays in integer arithmetic with no float fallback. Most mainstream stacks are strong on the first two. Very few are broadly production-ready for the third across arbitrary LLM checkpoints. (Hugging Face)

That distinction matters because GGUF is a file format, not a guarantee about arithmetic semantics. Hugging Face describes GGUF as a single-file format used to store models for inference with GGML-family executors. That makes it valuable for deployment and portability, but it does not by itself imply “the whole graph runs in integers.” (Hugging Face)

1. Current trends: what actually exists right now

Mainstream libraries

In official Hugging Face transformers, the documented quantization paths are AWQ, GPTQ, and 8-bit / 4-bit bitsandbytes support. Those are useful and widely adopted, but the docs do not present a general-purpose full integer-only LLM path. They present a practical low-bit deployment stack. (Hugging Face)

In core vLLM, the stable docs expose INT8 W8A8 and INT4 W4A16 paths, and the broader LLM Compressor docs focus on FP8, INT8, INT4, NVFP4, and MXFP4. That tells you where the center of gravity is today: mixed low-precision serving, not universal integer-only execution. There is an important backend-specific exception: vllm-ascend now advertises W4A4 support, which shows that full low-bit activation paths are getting closer to deployment on specific hardware stacks, but that is still not the same as a universal integer-only mode in mainstream vLLM. (vLLM)

So the direct answer to your first question is:

For arbitrary Hugging Face checkpoints, there is not yet a broadly production-ready, first-class integer-only path in mainstream transformers or core vLLM. What exists today is mostly hybrid, backend-specific, or architecture-specific. (Hugging Face)

Architecture-specific and kernel-specific paths

The clearest real example today is BitNet + bitnet.cpp. Microsoft’s open BitNet b1.58 2B4T model is described as a native 1-bit LLM with W1.58A8 inference, and both the model card and the Hugging Face docs explicitly warn that standard transformers execution does not contain the specialized kernels needed to realize the architecture’s real efficiency benefits. In other words, BitNet is important not only because it is low-bit, but because it shows that model + runtime co-design matters. (Hugging Face)

A second important project is T-MAC. Its repo and paper are highly relevant to your use case because they focus on mixed-precision matrix multiplication without dequantization, using lookup tables to avoid the usual “dequantize to higher precision, then compute” penalty. The paper frames the problem almost exactly the way you did: low-bit models often still depend on indirect dequantize-heavy execution, and that overhead is especially painful on CPUs and edge devices. (GitHub)

Research closest to your definition

If your definition is strict — the entire forward graph stays in integers — then I-LLM is one of the most directly relevant papers. Its core claim is that previous LLM PTQ methods still needed floating-point work for quantize/dequantize and nonlinear operators like RMSNorm and Softmax, and that a proper integer-only solution needs integer-only matmul, integer-only softmax/exponent approximations, and integer-only normalization. That is much closer to what you are asking about than GPTQ, AWQ, or most GGUF deployments. The caveat is that I-LLM is still best understood as research with strong systems direction, not yet a standard, turnkey mode in the dominant serving libraries. (arXiv)

2. Accuracy vs efficiency: how bad is the drop?

W8A8: already practical

For W8A8, the reference point is still SmoothQuant. It showed that 8-bit weights plus 8-bit activations can be made accurate enough for large LLMs by smoothing activation outliers, and reported up to 1.56× speedup and 2× memory reduction with negligible loss in accuracy. That is one reason W8A8 moved into real systems faster than stricter integer-only schemes: it solves much of the memory and throughput problem without forcing every awkward operator into a pure integer formulation. (GitHub)

vLLM’s stable INT8 W8A8 documentation is another sign that W8A8 has crossed from “interesting paper result” into “supported infrastructure,” at least on the right hardware. (vLLM)

Why activations are the hard part

A major reason integer-only is harder than weight-only quantization is activation outliers. Hugging Face’s optimum-quanto docs say per-tensor activation quantization to INT8 can cause serious errors when tensors contain large outliers, often collapsing most values to zero except the outliers. That is precisely why techniques like SmoothQuant exist. It is also why ordinary “just lower the bit-width” thinking breaks down once you try to quantize the whole graph. (GitHub)

W4A4: much better than before, but still not boring

There is older evidence showing that plain W4A4 is hard for decoder-only language models. A widely cited 2023 study found that W4A4 caused significant accuracy drop for decoder-only models, even though it worked much better for encoder-only and encoder-decoder architectures. That paper is still important because it explains why naive “just do everything in 4 bits” failed for autoregressive LLMs for a while. (arXiv)

Then the next wave of work changed the picture:

I-LLM reports W4A4 with negligible loss of accuracy by carefully redesigning the integer-only path. (arXiv)
QuaRot reports end-to-end 4-bit quantization of weights, activations, and KV cache, with at most 0.47 WikiText-2 perplexity loss on LLaMA-2-70B and 99% of zero-shot performance retained. (arXiv)
SpinQuant says learned rotations reduce the full-precision gap on zero-shot reasoning to 2.9 points on LLaMA-2-7B under 4-bit weights, activations, and KV cache. (arXiv)
FlatQuant reports less than 1% accuracy drop for W4A4 on LLaMA-3-70B, while also claiming strong efficiency gains from fusing its transformations. (OpenReview)
COMET argues that practical W4A4KV4 serving is possible with a mixed-precision activation strategy and optimized W4Ax kernels. (arXiv)

So the right summary is:

W8A8 is already practical. W4A4 is now credible. But W4A4 is still far more fragile, transformation-dependent, and kernel-dependent than W8A8. (arXiv)

Why papers and practice still diverge

A key systems lesson is that low-bit math is not enough by itself. QServe points this out very clearly: it says state-of-the-art INT4 methods can lose 20–90% runtime to dequantizing either weights or partial sums on GPUs. That is a major reason many “4-bit” systems do not deliver the speedups people expect. If the runtime keeps reconstructing higher precision internally, the theoretical gain shrinks fast. (arXiv)

That is also why your focus on true integer-only is well-placed. The real question is not only “can I compress the checkpoint?” It is “does the execution path stay low-bit all the way through?” (arXiv)

3. Future outlook: integer-only vs FP8 / MX / FP4

I do not think one format wins everywhere.

Datacenter direction

For the latest datacenter GPUs, the trend is clearly toward FP8 and FP4-family microscaled formats, not toward strict integer-only arithmetic as the universal standard. NVIDIA’s TensorRT docs are explicit: INT4 block quantization supports weight-only quantization, while FP4 block quantization supports both weights and activations. Their architecture docs also say INT4 is used for weight-only quantization and requires dequantization before compute. That is a strong signal about where server inference is going. (NVIDIA Docs)

The vLLM LLM Compressor docs point the same way. For Hopper, they recommend W8A8-FP8. For Blackwell, they recommend NVFP4 or MXFP4 for maximum compression, with FP8 as a balance point. That is not an anti-integer statement. It is a hardware reality statement: on the newest server GPUs, low-precision floating-point formats increasingly line up best with the available fast kernels. (vLLM)

The OCP MX standard matters here too. The spec was published in 2023 and defines interoperable microscaling formats designed to improve energy efficiency across datacenter and endpoint AI. The fact that the industry aligned around a standard family that includes MXFP8, MXFP6, MXFP4, and MXINT8 is another sign that the ecosystem wants microscaled low-precision formats, not only classical integer quantization. (Open Compute Project)

Edge and CPU/NPU direction

This is where your thesis looks strongest.

On edge-class devices, CPUs, older NPUs, and DSP-like accelerators, the value of integer-only or near-integer-native execution is much higher. There, avoiding dequantization and keeping kernels simple can matter more than following the newest FP8/FP4 tensor-core path. That is exactly the space T-MAC targets, and it is also why BitNet is interesting: both projects are effectively betting that native low-bit arithmetic plus specialized kernels matters a lot more off the bleeding edge of server GPUs. (GitHub)

Even vLLM’s own scheme guide hints at this split. In the same guidance that recommends FP8 or FP4-family formats for Hopper and Blackwell, it recommends W4AINT8 on Arm. That is exactly the sort of compromise that makes sense on edge and client hardware: very low-bit weights, integer activations, and hardware fit over theoretical purity. (vLLM)

So my forecast is:

Cloud / datacenter: FP8 and FP4-family microscaled formats likely become the mainstream low-precision standard. (NVIDIA Docs)
Edge / CPU / NPU / DSP-style deployments: integer-only or near-integer-native inference remains strategically important, especially when paired with model/runtime co-design. (GitHub)

4. What is actually production-ready today?

For general open-weight LLM deployment, the production-ready pieces today are mostly:

transformers quantization backends such as AWQ, GPTQ, and bitsandbytes 4/8-bit, (Hugging Face)
vLLM stable INT8 W8A8 and INT4 W4A16 paths, plus LLM Compressor for broader mixed-precision schemes, (vLLM)
backend-specific stacks like TensorRT, ONNX Runtime, and OpenVINO, which all support useful low-bit modes, but which are still mostly weight-only or mixed precision rather than “integer-only everywhere.” (NVIDIA Docs)

For true or near-true integer-centric inference, the most credible production-ish systems today are architecture-specific and kernel-specific, not universal:

BitNet + bitnet.cpp for native 1-bit-style models, (GitHub)
T-MAC for dequantization-free low-bit GEMM on CPU/NPU-style targets, (GitHub)
hardware-specific offshoots like vllm-ascend W4A4, which are meaningful signs of progress but not yet a general answer for all models and all runtimes. (vLLM)

5. What I would watch closely

If you want a practical watch list, I would follow these:

Research and algorithms

I-LLM for the clearest research definition of integer-only LLM inference. (arXiv)
SmoothQuant for the most deployable W8A8 baseline. (GitHub)
QuaRot, SpinQuant, FlatQuant, and COMET for the current W4A4 / W4A4KV4 frontier. (arXiv)

Systems and kernels

BitNet / bitnet.cpp for native low-bit model/runtime co-design. (GitHub)
T-MAC for dequantization-free low-bit CPU/NPU kernels. (GitHub)
QServe for the systems reality check that dequantization overhead can erase much of the theoretical INT4 win. (arXiv)

Mainstream deployment stacks

vLLM quantization + LLM Compressor docs for what is actually landing in deployable serving software, (vLLM)
TensorRT, ONNX Runtime, and OpenVINO quantization docs to see what production backends really support today and where “INT4” still means weight-only rather than full integer-only execution. (NVIDIA Docs)

6. How I would answer your three questions directly

1. Are there production-ready frameworks or architectures for fully integer-only inference right now?

For arbitrary LLM checkpoints in mainstream libraries, not really. The official docs in transformers and core vLLM emphasize useful quantization paths, but not a general integer-only mode. The most promising real systems today are either architecture-specific like BitNet, or kernel-specific like T-MAC, or backend-specific like certain W4A4 paths in vllm-ascend. (Hugging Face)

2. How significant is the accuracy degradation for W8A8 or W4A4 integer-only paths?

For W8A8, the answer is often “small enough to deploy,” especially with SmoothQuant-style handling of outliers. For W4A4, the answer is now “sometimes surprisingly small, but still much more fragile.” The best recent papers show strong results, but W4A4 is still much more dependent on careful transformations and optimized kernels than W8A8. (arXiv)

3. Will integer-only become the mainstream standard?

My view is no, not everywhere. In datacenters, the likely mainstream is FP8 plus FP4-family microscaled formats because that is what current server hardware accelerates well. On edge and integer-centric hardware, though, integer-only or near-integer-native inference still looks strategically important. So I expect a split future, not one universal winner. (NVIDIA Docs)

7. One practical recommendation

If your immediate goal is deploying existing open LLMs today, I would target W8A8 or a strong weight-only INT4 / mixed-precision path first. Those are where the ecosystem is stable enough to be productive now. If your real goal is edge research or integer-centric deployment, then I would spend most of my attention on I-LLM, BitNet, and T-MAC, because that is where the field is actually trying to solve the “no float fallback” problem instead of only shrinking checkpoints. (arXiv)

The cleanest conclusion is this:

Integer-only LLM inference is no longer a fringe idea, but in 2026 it is still mostly a specialized systems direction rather than the mainstream default. For general-purpose serving, the world is still dominated by hybrid quantization and, on the newest GPUs, increasingly by FP8 and microscaled FP4-family formats. For edge hardware and integer-centric accelerators, though, integer-only or near-integer-native inference still looks like one of the most promising long-term bets. (vLLM)

Topic		Replies	Views
I measured 360+ configs — quantization often costs energy below the crossover point Research	3	19	May 16, 2026
Bitsandbytes `has_fp16_weights` issue 🤗Transformers	1	235	August 15, 2024
[Guide] Quantize LLM CoreML to int8 on Mac ARM (TinyLlama, May 2025, tested workflow & script) 🤗Optimum	0	389	May 26, 2025
Does quantization compress the model weights? Research	16	576	September 26, 2024
Some questions about GPT-J inference using int8 🤗Transformers	3	1475	January 24, 2023