LUQ: Layerwise Ultra-Low-Bit Quantization
for Multimodal LLMs

Sub-4-bit compression of multimodal LLMs by selectively quantizing only the layers robust to it — the low-entropy ones. 31–40% smaller than 4-bit, with minimal accuracy loss.

1 UIUC  ·  2 UC Los Angeles  ·  3 HP Inc.  Equal contribution · work done during an internship at HP Inc.

Transactions on Machine Learning Research logo Published at TMLR 2026 Transactions on Machine Learning Research
LUQ Qwen-2.5-VL running on a laptop. The LUQ-quantized model generates noticeably faster than the 4-bit and 16-bit versions of the same model on the same device, while fitting in a fraction of the memory — see the deployment numbers below.

Abstract

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. Post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision, its effectiveness for multimodal LLMs (MLLMs) remains unexplored. In this paper, we present the first method for ultra-low-bit (<4-bit) quantization of MLLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher entropy compared to text tokens, indicating greater functional complexity that makes MLLMs less tolerant to ultra-low bit quantization. However, this entropy varies significantly across layers, with some layers producing lower-entropy activation distributions that we empirically show can better tolerate ultra-low bit quantization. Existing PTQ methods optimize weight quantization within each layer but apply the same target precision uniformly, ignoring this variation in complexity across layers. Building on this insight, we propose LUQ: Layerwise Ultra-Low Bit Quantization, which characterizes each transformer layer's functional complexity via its output activation entropy and selectively applies ultra-low bit quantization to layers encoding simpler, more compressible functions. We also show that multimodal calibration (image and text tokens) boosts VQA performance in the ultra-low bit regime. Evaluated on LLaVA-1.5 and Qwen2.5-VL across 9 VQA benchmarks, LUQ models use 40% and 31% less memory than their 4-bit counterparts while exhibiting less than 10% degradation on MME.

TL;DR

LUQ is a method for extreme (<4-bit) compression of multimodal LLMs. By selectively quantizing only the layers that are robust to it — those with low activation entropy — it cuts memory sharply while preserving vision-language performance.

Method

LUQ rests on a single observation: not all layers are equally fragile under quantization. Three ideas turn that into a compression recipe.

01 · Diagnose

Multimodal activations carry higher entropy than text — but that entropy varies sharply by layer. We measure per-layer activation entropy from multimodal calibration tokens to find the resilient layers.

02 · Quantize low-entropy first

Layers are quantized to ultra-low bit-width in ascending entropy order, iterating until a memory or performance budget is met. High-entropy layers are kept at higher precision.

03 · Mixed-token calibration

Calibrating PTQ with a mix of image and text tokens (ratio α) rather than text alone measurably boosts VQA accuracy in the ultra-low bit regime.

LUQ pipeline: generate multimodal calibration tokens, extract layerwise activations, compute activation entropy, iteratively quantize the lowest-entropy layers, then combine into a mixed-precision model
The LUQ pipeline. (i) Generate multimodal calibration tokens. (ii) Extract layerwise activations from the MLLM. (iii) Compute each layer's activation entropy to identify quantization-resilient (low-entropy) layers. (iv) Iteratively quantize the lowest-entropy layers to ultra-low bit-width until a performance/memory budget is met. (v) Combine the layers — now at different precisions — into the final compressed model.

Why it works

The observations behind LUQ, measured on Qwen-2.5-VL-7B.

Line plot of activation entropy vs layer depth for multimodal vs text-only tokens; multimodal is consistently higher
Multimodal > text entropy. Activations produced by multimodal tokens have consistently higher entropy than purely text tokens, helping explain why MLLMs are harder to quantize than text-only LLMs.
Activation entropy plotted across 27 layers, showing large variation between layers
Entropy varies by layer. Intermediate-activation entropy swings substantially with layer depth — some layers are far lower-entropy than others. This spread is exactly what LUQ exploits.
TextVQA accuracy vs number of quantized layers: lowest-entropy-first stays high, highest-entropy-first collapses
Order matters. Quantizing low-entropy layers first preserves TextVQA accuracy; quantizing high-entropy layers first causes a steep decline. The gap holds across the number of layers quantized — validating LUQ's selection rule.
TextVQA accuracy vs mixing ratio alpha; accuracy rises with alpha then saturates
Mix in image tokens. Even a small mixing ratio α > 0 of multimodal calibration tokens improves TextVQA performance, with gains saturating as α grows.

Results

Across 9 VQA benchmarks, LUQ delivers a far better performance-vs-size trade-off than standard PTQ, which collapses below 3 bits.

40%
Less memory than 4-bit on LLaVA-1.5, at comparable accuracy
31%
Less memory than 4-bit on Qwen-2.5-VL
<10%
MME degradation vs the FP16 baseline
Method Avg.
Bits
MME
Per.
MME
Cog.
MM
Bench
Text
VQA
VQAv2 GQA POPE Chart
QA
Doc
QA
Math
Vista
LLaVA-1.5 7B Backbone
FP16 (Baseline) 16151035063.458.278.562.083.223.6
GPTQ 4145034758.256.876.361.476.020.1
AWQ 4145634959.856.776.661.576.720.6
GPTQ 3134627331.254.173.558.870.516.4
GPTQ* 20.00.00.00.00.00.00.00.0
BiLLM 1.08561397.415.637.222.725.53.5
LUQ 16-layer (Ours) 2.54136525746.753.474.958.274.518.7
Qwen-2.5-VL 7B Backbone
FP16 (Baseline) 16169564082.684.983.560.586.187.395.768.2
GPTQ 4163861080.284.282.660.184.884.193.444.8
AWQ 4164562080.984.682.760.585.684.593.546.1
GPTQ 331913134.779.581.553.482.961.089.221.0
GPTQ* 20.00.00.00.00.00.00.00.00.00.0
BiLLM 1.08638429.726.339.54.370.73.720.315.1
LUQ 12-layer (Ours) 2.75164060063.781.979.752.984.768.690.541.7

All scores higher-is-better. * indicates models that produced incoherent/gibberish output. ChartQA and DocQA are omitted for LLaVA-1.5, whose FP16 baseline is too low for a meaningful quantization comparison. LUQ values are means over 3 runs.

Performance vs. memory trade-off

Progressively quantizing more layers (lowest entropy first) traces a trade-off frontier. LUQ stays close to FP16 well into the sub-3-bit regime, where GPTQ and AWQ collapse.

MME score vs average bit-width for LLaVA-1.5; LUQ degrades gracefully while GPTQ/AWQ collapse below 3 bits
LLaVA-1.5 7B. MME score vs. average bit-width.
MME score vs average bit-width for Qwen-2.5-VL; LUQ holds near FP16 down to low bit-widths
Qwen-2.5-VL 7B. MME score vs. average bit-width.

Real-world deployment

Because LUQ uses inter-layer (not intra-layer) mixed precision, each layer runs with a single homogeneous kernel — so existing engines like llama.cpp can run it directly. On Qwen-2.5-VL 7B, that translates into real speedups on commodity CPUs.

45×
Faster than FP16 on an Intel i7 laptop (9.0 vs 0.2 tok/s)
1.9×
Faster than standard 4-bit (Q4_K_M) on the same laptop
3.4 GB
Memory footprint — ~23% below 4-bit, ~4× below FP16
Model configuration Intel i7-13620H
laptop, tok/s
AMD Threadripper
workstation, tok/s
Memory
FP16 (Baseline)0.24.914.5 GB
Q4_K_M (Standard 4-bit)4.814.14.4 GB
LUQ (Mixed Precision)9.018.73.4 GB

Generation throughput (tokens/sec) via llama.cpp, averaged over 10 runs. Ultra-low-bit layers map to IQ1_M (~1.75 bpw) and high-precision layers to Q4_K_M (4-bit).

BibTeX

If you find our work useful, please consider citing:

@article{bhatnagar2026luq,
    title   = {LUQ: Layerwise Ultra-Low Bit Quantization for
               Multimodal Large Language Models},
    author  = {Bhatnagar, Shubhang and Xu, Andy and Tan, Kar-Han
               and Ahuja, Narendra},
    journal = {Transactions on Machine Learning Research (TMLR)},
    year    = {2026},
    url     = {https://openreview.net/forum?id=3eK6U6ZiSp}
}