Unix: add quantization methods
This commit is contained in:
parent
272806b28c
commit
e509097d20
1 changed files with 108 additions and 0 deletions
108
src/content/unix/quantization.md
Normal file
108
src/content/unix/quantization.md
Normal file
|
@ -0,0 +1,108 @@
|
||||||
|
---
|
||||||
|
title: 'Llama Quantization Methods'
|
||||||
|
description: 'A short overview of modern quantization methods in language models'
|
||||||
|
updateDate: 'Dec 31 2023'
|
||||||
|
heroImage: '/images/tiny-llama-logo.avif'
|
||||||
|
---
|
||||||
|
|
||||||
|
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
|
||||||
|
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
|
||||||
|
TinyLlama project</a>. Licensed under Apache 2.0
|
||||||
|
</p>
|
||||||
|
|
||||||
|
"Llama" refers to a Large Language Model (LLM). "Local llama" refers to a
|
||||||
|
locally-hosted (typically open source) llama, in contrast to commercially hosted
|
||||||
|
ones.
|
||||||
|
|
||||||
|
# Quantization Methods
|
||||||
|
|
||||||
|
Quantization is the process of "compressing" a model's weights by changing them
|
||||||
|
to lower-precision representations. Typically this goes from a 32bit float, to
|
||||||
|
around 4bits, which is important for low-memory systems. These are then
|
||||||
|
dynamically cast to Bfloat16 at runtime for inference.
|
||||||
|
|
||||||
|
Quantization saves space but makes inference slower due to the dynamic cast and
|
||||||
|
loses precision, making models worse. However, the losses are typically
|
||||||
|
acceptable; a 4-bit quantized 56B model outperforms a 7B unquantized model. For
|
||||||
|
local llamas, it's even more important as most people don't have computers with
|
||||||
|
several 100GB of VRAM, making quantization necessary for running these models in
|
||||||
|
the first place.
|
||||||
|
|
||||||
|
Pre-quantization is the method of quantizing before running. This makes it
|
||||||
|
possible for people without massive amounts of RAM to obtain quantized models.
|
||||||
|
Currently, [TheBloke](https://huggingface.co/TheBloke) is the standard
|
||||||
|
distributor of pre-quantized models. He will often upload pre-quantized versions
|
||||||
|
of the latest models within a few days.
|
||||||
|
|
||||||
|
There are currently three competing standards for quantization, each with their
|
||||||
|
own pros and cons. In general, for your own local llamas, you likely want
|
||||||
|
[GGUF](#gguf).
|
||||||
|
|
||||||
|
## GGUF
|
||||||
|
|
||||||
|
GGUF is currently the most widely used standard amongst local llama enthusiasts.
|
||||||
|
It allows using the CPU RAM, in addition to the VRAM, to run models. This means
|
||||||
|
the maximum models size is RAM + VRAM, not just VRAM. As more of the model is
|
||||||
|
loaded into the RAM, inference becomes slower.
|
||||||
|
|
||||||
|
Background:
|
||||||
|
- Developed by Meta for [Llama2] and prompted by
|
||||||
|
[llama.cpp](https://github.com/ggerganov/llama.cpp) to supersede (and
|
||||||
|
deprecate) GGML.
|
||||||
|
- Currently the most popular format amongst hobbyists.
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- The only format to run on both RAM and VRAM.
|
||||||
|
- Offloads as much computation as possible onto the GPU.
|
||||||
|
- Works spectacularly on Apple Silicon's unified memory model.
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- Theoretically slower for models that can fit entirely into the VRAM. Not
|
||||||
|
noticeable in practice.
|
||||||
|
- Requires an additional conversation step, since base models are typically
|
||||||
|
released in a safetensors format. [TheBloke](https://huggingface.co/TheBloke)
|
||||||
|
often prioritizes GGUF quantizations.
|
||||||
|
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
|
||||||
|
available in this format.
|
||||||
|
|
||||||
|
## GPTQ
|
||||||
|
|
||||||
|
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
|
||||||
|
enough VRAM, this is a good default choice. You'll often see this using the file
|
||||||
|
extension ".safetensors" and occasionally ".bin" on Huggingface.
|
||||||
|
|
||||||
|
Background:
|
||||||
|
- Introduced by [this paper](https://arxiv.org/abs/2210.17323) in 2022.
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- Very fast, due to running entirely in VRAM.
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- Can't run any model that exceeds VRAM capacity.
|
||||||
|
|
||||||
|
## AWQ
|
||||||
|
|
||||||
|
Stands for "Activation-aware Weight Quantization".
|
||||||
|
|
||||||
|
This is bleeding-edge of quantization standards and a direct competitor to GPTQ.
|
||||||
|
It uses "mixed-quantization", which means it doesn't quantize all the weights.
|
||||||
|
Leaving the n% most frequently used weights unquantized is primary meant as a
|
||||||
|
way to avoid the computational cost of casting the weights to Bfloat16 all the
|
||||||
|
time. However, this also helps with model accuracy, as the most frequently used
|
||||||
|
weights retain their full precision.
|
||||||
|
|
||||||
|
Background:
|
||||||
|
- Paper released in [June 2023](https://arxiv.org/abs/2306.00978).
|
||||||
|
|
||||||
|
Pros:
|
||||||
|
- Doesn't quantize the top n% most used weights.
|
||||||
|
- Very fast, due to running entirely in VRAM.
|
||||||
|
|
||||||
|
Cons:
|
||||||
|
- Not yet supported on most major backends: ~~llama.cpp,~~
|
||||||
|
[Ollama](https://ollama.ai)... Support has been [merged on
|
||||||
|
llama.cpp](https://github.com/ggerganov/llama.cpp/pull/4593).
|
||||||
|
- Slightly bigger file size, as some weights aren't quantized.
|
||||||
|
- Can't run any model that exceeds VRAM capacity.
|
||||||
|
- The format is new, so older models will often not have AWQ pre-quantization
|
||||||
|
done for them.
|
Loading…
Reference in a new issue