Unix: add quantization methods

This commit is contained in:
Akemi Izuko 2023-12-31 15:44:32 -07:00
parent 272806b28c
commit e509097d20
Signed by: akemi
GPG key ID: 8DE0764E1809E9FC

View file

@ -0,0 +1,108 @@
---
title: 'Llama Quantization Methods'
description: 'A short overview of modern quantization methods in language models'
updateDate: 'Dec 31 2023'
heroImage: '/images/tiny-llama-logo.avif'
---
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
TinyLlama project</a>. Licensed under Apache 2.0
</p>
"Llama" refers to a Large Language Model (LLM). "Local llama" refers to a
locally-hosted (typically open source) llama, in contrast to commercially hosted
ones.
# Quantization Methods
Quantization is the process of "compressing" a model's weights by changing them
to lower-precision representations. Typically this goes from a 32bit float, to
around 4bits, which is important for low-memory systems. These are then
dynamically cast to Bfloat16 at runtime for inference.
Quantization saves space but makes inference slower due to the dynamic cast and
loses precision, making models worse. However, the losses are typically
acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For
local llamas, it's even more important as most people don't have computers with
several 100GB of VRAM, making quantization necessary for running these models in
the first place.
Pre-quantization is the method of quantizing before running. This makes it
possible for people without massive amounts of RAM to obtain quantized models.
Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard
distributor of pre-quantized models. He will often upload pre-quantized versions
of the latest models within a few days.
There are currently three competing standards for quantization, each with their
own pros and cons. In general, for your own local llamas, you likely want
[GGUF](#gguf).
## GGUF
GGUF is currently the most widely used standard amongst local llama enthusiasts.
It allows using the CPU RAM, in addition to the VRAM, to run models. This means
the maximum models size is RAM + VRAM, not just VRAM. As more of the model is
loaded into the RAM, inference becomes slower.
Background:
- Developed by Meta for [Llama2] and prompted by
[llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and
deprecate) GGML.
- Currently the most popular format amongst hobbyists.
Pros:
- The only format to run on both RAM and VRAM.
- Offloads as much computation as possible onto the GPU.
- Works spectacularly on Apple Silicon's unified memory model.
Cons:
- Theoretically slower for models that can fit entirely into the VRAM. Not
noticeable in practice.
- Requires an additional conversation step, since base models are typically
released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke)
often prioritizes GGUF quantizations.
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
available in this format.
## GPTQ
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
enough VRAM, this is a good default choice. You'll often see this using the file
extension ".safetensors" and occasionally ".bin" on Huggingface.
Background:
- Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022.
Pros:
- Very fast, due to running entirely in VRAM.
Cons:
- Can't run any model that exceeds VRAM capacity.
## AWQ
Stands for "Activation-aware Weight Quantization".
This is bleeding-edge of quantization standards and a direct competitor to GPTQ.
It uses "mixed-quantization", which means it doesn't quantize all the weights.
Leaving the n% most frequently used weights unquantized is primary meant as a
way to avoid the computational cost of casting the weights to Bfloat16 all the
time. However, this also helps with model accuracy, as the most frequently used
weights retain their full precision.
Background:
- Paper released in [June 2023](https://arxiv.org/abs/2306.00978).
Pros:
- Doesn't quantize the top n% most used weights.
- Very fast, due to running entirely in VRAM.
Cons:
- Not yet supported on most major backends: ~~llama.cpp,~~
[Ollama](https://ollama.ai)... Support has been [merged on
llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593).
- Slightly bigger file size, as some weights aren't quantized.
- Can't run any model that exceeds VRAM capacity.
- The format is new, so older models will often not have AWQ pre-quantization
done for them.