From e509097d200f40afc14e946fad2f71cbfd7b2690 Mon Sep 17 00:00:00 2001 From: Akemi Izuko Date: Sun, 31 Dec 2023 15:44:32 -0700 Subject: [PATCH] Unix: add quantization methods --- src/content/unix/quantization.md | 108 +++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 src/content/unix/quantization.md diff --git a/src/content/unix/quantization.md b/src/content/unix/quantization.md new file mode 100644 index 0000000..9a1027c --- /dev/null +++ b/src/content/unix/quantization.md @@ -0,0 +1,108 @@ +--- +title: 'Llama Quantization Methods' +description: 'A short overview of modern quantization methods in language models' +updateDate: 'Dec 31 2023' +heroImage: '/images/tiny-llama-logo.avif' +--- + +

+ "TinyLlama logo" by The + TinyLlama project. Licensed under Apache 2.0 +

+ +"Llama" refers to a Large Language Model (LLM). "Local llama" refers to a +locally-hosted (typically open source) llama, in contrast to commercially hosted +ones. + +# Quantization Methods + +Quantization is the process of "compressing" a model's weights by changing them +to lower-precision representations. Typically this goes from a 32bit float, to +around 4bits, which is important for low-memory systems. These are then +dynamically cast to Bfloat16 at runtime for inference. + +Quantization saves space but makes inference slower due to the dynamic cast and +loses precision, making models worse. However, the losses are typically +acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For +local llamas, it's even more important as most people don't have computers with +several 100GB of VRAM, making quantization necessary for running these models in +the first place. + +Pre-quantization is the method of quantizing before running. This makes it +possible for people without massive amounts of RAM to obtain quantized models. +Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard +distributor of pre-quantized models. He will often upload pre-quantized versions +of the latest models within a few days. + +There are currently three competing standards for quantization, each with their +own pros and cons. In general, for your own local llamas, you likely want +[GGUF](#gguf). + +## GGUF + +GGUF is currently the most widely used standard amongst local llama enthusiasts. +It allows using the CPU RAM, in addition to the VRAM, to run models. This means +the maximum models size is RAM + VRAM, not just VRAM. As more of the model is +loaded into the RAM, inference becomes slower. + +Background: + - Developed by Meta for [Llama2] and prompted by + [llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and + deprecate) GGML. + - Currently the most popular format amongst hobbyists. + +Pros: + - The only format to run on both RAM and VRAM. + - Offloads as much computation as possible onto the GPU. + - Works spectacularly on Apple Silicon's unified memory model. + +Cons: + - Theoretically slower for models that can fit entirely into the VRAM. Not + noticeable in practice. + - Requires an additional conversation step, since base models are typically + released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke) + often prioritizes GGUF quantizations. + - Quantization into GGUF can fail, meaning some bleeding-edge models aren't + available in this format. + +## GPTQ + +GPTQ is the standard for models that are fully loaded into the VRAM. If you have +enough VRAM, this is a good default choice. You'll often see this using the file +extension ".safetensors" and occasionally ".bin" on Huggingface. + +Background: + - Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022. + +Pros: + - Very fast, due to running entirely in VRAM. + +Cons: + - Can't run any model that exceeds VRAM capacity. + +## AWQ + +Stands for "Activation-aware Weight Quantization". + +This is bleeding-edge of quantization standards and a direct competitor to GPTQ. +It uses "mixed-quantization", which means it doesn't quantize all the weights. +Leaving the n% most frequently used weights unquantized is primary meant as a +way to avoid the computational cost of casting the weights to Bfloat16 all the +time. However, this also helps with model accuracy, as the most frequently used +weights retain their full precision. + +Background: + - Paper released in [June 2023](https://arxiv.org/abs/2306.00978). + +Pros: + - Doesn't quantize the top n% most used weights. + - Very fast, due to running entirely in VRAM. + +Cons: + - Not yet supported on most major backends: ~~llama.cpp,~~ + [Ollama](https://ollama.ai)... Support has been [merged on + llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593). + - Slightly bigger file size, as some weights aren't quantized. + - Can't run any model that exceeds VRAM capacity. + - The format is new, so older models will often not have AWQ pre-quantization + done for them.