LUQ: Layerwise Ultra-Low Bit Quantization for MLLMs

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Shubhang Bhatnagar^1,†
Andy Xu^2,†
Kar Han Tan³
Narendra Ahuja¹

^†Equal Contribution and work done as an intern at HP Inc.

¹ University of Illinois at Urbana-Champaign
² University of California, Los Angeles
³ HP Inc.

Arxiv

A demonstration of LUQ Qwen 2.5 VL running on a laptop, showing that it runs significantly faster than the 4-bit and 16-bit versions of the model on the same device

Abstract

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.

TL;DR:

We introduce LUQ, a method for extreme (<4-bit) compression of Multimodal LLMs. By selectively quantizing only the layers that are robust to it (low activation entropy), we significantly reduce memory usage while preserving performance on vision-language tasks.

Method Overview

An overview of our LUQ pipeline: (i) Generate multimodal calibration tokens. (ii) Extract layerwise activations from the MLLM. (iii) Calculate the activation entropy for each layer to identify quantization-resilient layers (lower entropy). (iv) Iteratively quantize the lowest-entropy layers to an ultra-low bit-width until a performance or memory budget is met. (v) Combine the layers, already quantized into different precision to the final compressed model.

Performance vs. Memory Trade-off

LUQ allows for a graceful trade-off between performance and model size. By progressively quantizing more layers (starting from the lowest entropy), we can achieve different compression rates. The plots below show that LUQ consistently creates a better performance-vs-compression frontier compared to standard PTQ methods like GPTQ and AWQ, which suffer catastrophic performance collapse at sub-3-bit compression.

LLaVA 1.5 7B

Performance vs Memory Tradeoff on MME for Qwen 2.5 VL

Qwen 2.5 VL 7B

Comparison to State-of-the-Art Methods

We compare LUQ to state-of-the-art PTQ methods on 9 VQA benchmarks for LLaVA-1.5 7B and Qwen 2.5 VL 7B models. The table below shows that LUQ achieves a much better trade-off between performance and model size. For LLaVA-1.5, LUQ is 40% smaller than 4-bit models with comparable accuracy. For Qwen 2.5 VL, LUQ provides a 31.5% memory reduction while maintaining strong performance.

Method	Avg. Bits	MME Per.	MME Cog.	MM Bench	Text VQA	VQAv2	GQA	POPE	Chart QA	Doc QA	Math Vista
LLaVA-1.5 7B Backbone
FP16 (Baseline)	16	1510	350	63.4	58.2	78.5	62.0	83.2	-	-	23.6
GPTQ	4	1450	347	58.2	56.8	76.3	61.4	76.0	-	-	20.1
AWQ	4	1456	349	59.8	56.7	76.6	61.5	76.7	-	-	20.6
GPTQ	3	1346	273	31.2	54.1	73.5	58.8	70.5	-	-	16.4
GPTQ*	2	0	0	0.0	0	0	0.0	0.0	-	-	0.0
BiLLM	1.08	561	39	15.6	7.4	37.2	22.7	25.5	-	-	3.5
LUQ (Ours)	2.54	1365	257	53.4	46.7	74.9	58.2	74.5	-	-	18.7
Qwen 2.5 VL Backbone
FP16 (Baseline)	16	1695	640	84.9	82.6	83.5	60.5	86.1	87.3	95.7	68.2
GPTQ	4	1638	610	84.2	80.2	82.6	60.1	84.8	84.1	93.4	44.8
AWQ	4	1645	620	80.9	84.6	82.7	60.5	85.6	84.5	93.5	46.1
GPTQ	3	319	131	34.7	79.5	81.5	53.4	82.9	61.0	89.2	21.0
GPTQ*	2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
BiLLM	1.08	638	42	9.7	26.3	39.5	4.3	70.7	3.7	20.3	15.1
LUQ (Ours)	2.75	1640	600	63.7	81.9	79.7	52.9	84.7	68.6	90.5	41.7

* indicates models with incoherent/gibberish output.

Citation

If you find our work useful, please consider citing:

@misc{bhatnagar2025luqlayerwiseultralowbit,
    title={LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models}, 
    author={Shubhang Bhatnagar and Andy Xu and Kar-Han Tan and Narendra Ahuja},
    year={2025},
    eprint={2509.23729},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2509.23729}, 
}

The website template was borrowed from Michaël Gharbi and Ref-NeRF.