From e509097d200f40afc14e946fad2f71cbfd7b2690 Mon Sep 17 00:00:00 2001
From: Akemi Izuko <akemi@noway.moe>
Date: Sun, 31 Dec 2023 15:44:32 -0700
Subject: [PATCH] Unix: add quantization methods

---
 src/content/unix/quantization.md | 108 +++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 src/content/unix/quantization.md
diff --git a/src/content/unix/quantization.md b/src/content/unix/quantization.md
new file mode 100644
index 0000000..9a1027c
--- /dev/null
+++ b/src/content/unix/quantization.md
@@ -0,0 +1,108 @@
+---
+title: 'Llama Quantization Methods'
+description: 'A short overview of modern quantization methods in language models'
+updateDate: 'Dec 31 2023'
+heroImage: '/images/tiny-llama-logo.avif'
+---
+
+<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
+    "TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
+    TinyLlama project</a>. Licensed under Apache 2.0
+</p>
+
+"Llama" refers to a Large Language Model (LLM). "Local llama" refers to a
+locally-hosted (typically open source) llama, in contrast to commercially hosted
+ones.
+
+# Quantization Methods
+
+Quantization is the process of "compressing" a model's weights by changing them
+to lower-precision representations. Typically this goes from a 32bit float, to
+around 4bits, which is important for low-memory systems. These are then
+dynamically cast to Bfloat16 at runtime for inference.
+
+Quantization saves space but makes inference slower due to the dynamic cast and
+loses precision, making models worse. However, the losses are typically
+acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For
+local llamas, it's even more important as most people don't have computers with
+several 100GB of VRAM, making quantization necessary for running these models in
+the first place.
+
+Pre-quantization is the method of quantizing before running. This makes it
+possible for people without massive amounts of RAM to obtain quantized models.
+Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard
+distributor of pre-quantized models. He will often upload pre-quantized versions
+of the latest models within a few days.
+
+There are currently three competing standards for quantization, each with their
+own pros and cons. In general, for your own local llamas, you likely want
+[GGUF](#gguf).
+
+## GGUF
+
+GGUF is currently the most widely used standard amongst local llama enthusiasts.
+It allows using the CPU RAM, in addition to the VRAM, to run models. This means
+the maximum models size is RAM + VRAM, not just VRAM. As more of the model is
+loaded into the RAM, inference becomes slower.
+
+Background:
+ - Developed by Meta for [Llama2] and prompted by
+   [llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and
+   deprecate) GGML.
+ - Currently the most popular format amongst hobbyists.
+
+Pros:
+ - The only format to run on both RAM and VRAM.
+ - Offloads as much computation as possible onto the GPU.
+ - Works spectacularly on Apple Silicon's unified memory model.
+
+Cons:
+ - Theoretically slower for models that can fit entirely into the VRAM. Not
+   noticeable in practice.
+ - Requires an additional conversation step, since base models are typically
+   released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke)
+   often prioritizes GGUF quantizations.
+ - Quantization into GGUF can fail, meaning some bleeding-edge models aren't
+   available in this format.
+
+## GPTQ
+
+GPTQ is the standard for models that are fully loaded into the VRAM. If you have
+enough VRAM, this is a good default choice. You'll often see this using the file
+extension ".safetensors" and occasionally ".bin" on Huggingface.
+
+Background:
+ - Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022.
+
+Pros:
+ - Very fast, due to running entirely in VRAM.
+
+Cons:
+ - Can't run any model that exceeds VRAM capacity.
+
+## AWQ
+
+Stands for "Activation-aware Weight Quantization".
+
+This is bleeding-edge of quantization standards and a direct competitor to GPTQ.
+It uses "mixed-quantization", which means it doesn't quantize all the weights.
+Leaving the n% most frequently used weights unquantized is primary meant as a
+way to avoid the computational cost of casting the weights to Bfloat16 all the
+time. However, this also helps with model accuracy, as the most frequently used
+weights retain their full precision.
+
+Background:
+ - Paper released in [June 2023](https://arxiv.org/abs/2306.00978).
+
+Pros:
+ - Doesn't quantize the top n% most used weights.
+ - Very fast, due to running entirely in VRAM.
+
+Cons:
+ - Not yet supported on most major backends: ~~llama.cpp,~~
+   [Ollama](https://ollama.ai)... Support has been [merged on
+   llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593).
+ - Slightly bigger file size, as some weights aren't quantized.
+ - Can't run any model that exceeds VRAM capacity.
+ - The format is new, so older models will often not have AWQ pre-quantization
+   done for them.