Unix: add quantization methods

2023-12-31 15:44:32 -07:00 · 2023-12-31 15:44:32 -07:00 · e509097d20
commit e509097d20
parent 272806b28c
1 changed files with 108 additions and 0 deletions
--- a/src/content/unix/quantization.md
+++ b/src/content/unix/quantization.md
@ -0,0 +1,108 @@
 ---
 title: 'Llama Quantization Methods'
 description: 'A short overview of modern quantization methods in language models'
 updateDate: 'Dec 31 2023'
 heroImage: '/images/tiny-llama-logo.avif'
 ---
 <p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
    "TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
    TinyLlama project</a>. Licensed under Apache 2.0
 </p>
 "Llama" refers to a Large Language Model (LLM). "Local llama" refers to a
 locally-hosted (typically open source) llama, in contrast to commercially hosted
 ones.
 # Quantization Methods
 Quantization is the process of "compressing" a model's weights by changing them
 to lower-precision representations. Typically this goes from a 32bit float, to
 around 4bits, which is important for low-memory systems. These are then
 dynamically cast to Bfloat16 at runtime for inference.
 Quantization saves space but makes inference slower due to the dynamic cast and
 loses precision, making models worse. However, the losses are typically
 acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For
 local llamas, it's even more important as most people don't have computers with
 several 100GB of VRAM, making quantization necessary for running these models in
 the first place.
 Pre-quantization is the method of quantizing before running. This makes it
 possible for people without massive amounts of RAM to obtain quantized models.
 Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard
 distributor of pre-quantized models. He will often upload pre-quantized versions
 of the latest models within a few days.
 There are currently three competing standards for quantization, each with their
 own pros and cons. In general, for your own local llamas, you likely want
 [GGUF](#gguf).
 ## GGUF
 GGUF is currently the most widely used standard amongst local llama enthusiasts.
 It allows using the CPU RAM, in addition to the VRAM, to run models. This means
 the maximum models size is RAM + VRAM, not just VRAM. As more of the model is
 loaded into the RAM, inference becomes slower.
 Background:
 - Developed by Meta for [Llama2] and prompted by
   [llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and
   deprecate) GGML.
 - Currently the most popular format amongst hobbyists.
 Pros:
 - The only format to run on both RAM and VRAM.
 - Offloads as much computation as possible onto the GPU.
 - Works spectacularly on Apple Silicon's unified memory model.
 Cons:
 - Theoretically slower for models that can fit entirely into the VRAM. Not
   noticeable in practice.
 - Requires an additional conversation step, since base models are typically
   released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke)
   often prioritizes GGUF quantizations.
 - Quantization into GGUF can fail, meaning some bleeding-edge models aren't
   available in this format.
 ## GPTQ
 GPTQ is the standard for models that are fully loaded into the VRAM. If you have
 enough VRAM, this is a good default choice. You'll often see this using the file
 extension ".safetensors" and occasionally ".bin" on Huggingface.
 Background:
 - Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022.
 Pros:
 - Very fast, due to running entirely in VRAM.
 Cons:
 - Can't run any model that exceeds VRAM capacity.
 ## AWQ
 Stands for "Activation-aware Weight Quantization".
 This is bleeding-edge of quantization standards and a direct competitor to GPTQ.
 It uses "mixed-quantization", which means it doesn't quantize all the weights.
 Leaving the n% most frequently used weights unquantized is primary meant as a
 way to avoid the computational cost of casting the weights to Bfloat16 all the
 time. However, this also helps with model accuracy, as the most frequently used
 weights retain their full precision.
 Background:
 - Paper released in [June 2023](https://arxiv.org/abs/2306.00978).
 Pros:
 - Doesn't quantize the top n% most used weights.
 - Very fast, due to running entirely in VRAM.
 Cons:
 - Not yet supported on most major backends: ~~llama.cpp,~~
   [Ollama](https://ollama.ai)... Support has been [merged on
   llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593).
 - Slightly bigger file size, as some weights aren't quantized.
 - Can't run any model that exceeds VRAM capacity.
 - The format is new, so older models will often not have AWQ pre-quantization
   done for them.