Unix: add quantization methods
This commit is contained in:
parent
272806b28c
commit
e509097d20
108
src/content/unix/quantization.md
Normal file
108
src/content/unix/quantization.md
Normal file
|
@ -0,0 +1,108 @@
|
|||
---
|
||||
title: 'Llama Quantization Methods'
|
||||
description: 'A short overview of modern quantization methods in language models'
|
||||
updateDate: 'Dec 31 2023'
|
||||
heroImage: '/images/tiny-llama-logo.avif'
|
||||
---
|
||||
|
||||
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
|
||||
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
|
||||
TinyLlama project</a>. Licensed under Apache 2.0
|
||||
</p>
|
||||
|
||||
"Llama" refers to a Large Language Model (LLM). "Local llama" refers to a
|
||||
locally-hosted (typically open source) llama, in contrast to commercially hosted
|
||||
ones.
|
||||
|
||||
# Quantization Methods
|
||||
|
||||
Quantization is the process of "compressing" a model's weights by changing them
|
||||
to lower-precision representations. Typically this goes from a 32bit float, to
|
||||
around 4bits, which is important for low-memory systems. These are then
|
||||
dynamically cast to Bfloat16 at runtime for inference.
|
||||
|
||||
Quantization saves space but makes inference slower due to the dynamic cast and
|
||||
loses precision, making models worse. However, the losses are typically
|
||||
acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For
|
||||
local llamas, it's even more important as most people don't have computers with
|
||||
several 100GB of VRAM, making quantization necessary for running these models in
|
||||
the first place.
|
||||
|
||||
Pre-quantization is the method of quantizing before running. This makes it
|
||||
possible for people without massive amounts of RAM to obtain quantized models.
|
||||
Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard
|
||||
distributor of pre-quantized models. He will often upload pre-quantized versions
|
||||
of the latest models within a few days.
|
||||
|
||||
There are currently three competing standards for quantization, each with their
|
||||
own pros and cons. In general, for your own local llamas, you likely want
|
||||
[GGUF](#gguf).
|
||||
|
||||
## GGUF
|
||||
|
||||
GGUF is currently the most widely used standard amongst local llama enthusiasts.
|
||||
It allows using the CPU RAM, in addition to the VRAM, to run models. This means
|
||||
the maximum models size is RAM + VRAM, not just VRAM. As more of the model is
|
||||
loaded into the RAM, inference becomes slower.
|
||||
|
||||
Background:
|
||||
- Developed by Meta for [Llama2] and prompted by
|
||||
[llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and
|
||||
deprecate) GGML.
|
||||
- Currently the most popular format amongst hobbyists.
|
||||
|
||||
Pros:
|
||||
- The only format to run on both RAM and VRAM.
|
||||
- Offloads as much computation as possible onto the GPU.
|
||||
- Works spectacularly on Apple Silicon's unified memory model.
|
||||
|
||||
Cons:
|
||||
- Theoretically slower for models that can fit entirely into the VRAM. Not
|
||||
noticeable in practice.
|
||||
- Requires an additional conversation step, since base models are typically
|
||||
released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke)
|
||||
often prioritizes GGUF quantizations.
|
||||
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
|
||||
available in this format.
|
||||
|
||||
## GPTQ
|
||||
|
||||
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
|
||||
enough VRAM, this is a good default choice. You'll often see this using the file
|
||||
extension ".safetensors" and occasionally ".bin" on Huggingface.
|
||||
|
||||
Background:
|
||||
- Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022.
|
||||
|
||||
Pros:
|
||||
- Very fast, due to running entirely in VRAM.
|
||||
|
||||
Cons:
|
||||
- Can't run any model that exceeds VRAM capacity.
|
||||
|
||||
## AWQ
|
||||
|
||||
Stands for "Activation-aware Weight Quantization".
|
||||
|
||||
This is bleeding-edge of quantization standards and a direct competitor to GPTQ.
|
||||
It uses "mixed-quantization", which means it doesn't quantize all the weights.
|
||||
Leaving the n% most frequently used weights unquantized is primary meant as a
|
||||
way to avoid the computational cost of casting the weights to Bfloat16 all the
|
||||
time. However, this also helps with model accuracy, as the most frequently used
|
||||
weights retain their full precision.
|
||||
|
||||
Background:
|
||||
- Paper released in [June 2023](https://arxiv.org/abs/2306.00978).
|
||||
|
||||
Pros:
|
||||
- Doesn't quantize the top n% most used weights.
|
||||
- Very fast, due to running entirely in VRAM.
|
||||
|
||||
Cons:
|
||||
- Not yet supported on most major backends: ~~llama.cpp,~~
|
||||
[Ollama](https://ollama.ai)... Support has been [merged on
|
||||
llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593).
|
||||
- Slightly bigger file size, as some weights aren't quantized.
|
||||
- Can't run any model that exceeds VRAM capacity.
|
||||
- The format is new, so older models will often not have AWQ pre-quantization
|
||||
done for them.
|
Loading…
Reference in a new issue