Papers: Quantization
updated
FP8-LM: Training FP8 Large Language Models
Paper
• 2310.18313
• Published
• 33
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
• 2310.16836
• Published
• 14
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Paper
• 2310.10944
• Published
• 10
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
• 2309.16119
• Published
• 1
AWQ: Activation-aware Weight Quantization for LLM Compression and
Acceleration
Paper
• 2306.00978
• Published
• 11
LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Paper
• 2305.18403
• Published
• 3
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
Language Models
Paper
• 2211.10438
• Published
• 6
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers
Paper
• 2210.17323
• Published
• 10
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
• 2208.07339
• Published
• 5
Optimize Weight Rounding via Signed Gradient Descent for the
Quantization of LLMs
Paper
• 2309.05516
• Published
• 11
Paper
• 2502.06786
• Published
• 32
MixLLM: LLM Quantization with Global Mixed-precision between
Output-features and Highly-efficient System Design
Paper
• 2412.14590
• Published
• 15