NoWag: Unified Compression for Large Language Models

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

We discuss NoWag, a novel framework for compressing large language models (LLMs) while preserving their structure. This unified approach, encompassing both pruning (removing less important connections) and vector quantization (grouping and reducing the precision of weights), uses a normalization technique guided by weight and activation data. Experiments on Llama models demonstrate that NoWag significantly outperforms existing state-of-the-art zero-shot quantization methods with less data and achieves competitive results in pruning, suggesting a shared underlying principle for effective LLM compression.