Update gguf quants blog

This commit is contained in:
Akemi Izuko 2024-03-10 18:01:05 -06:00
parent 60cc4129db
commit cc7724eeea
Signed by: akemi
GPG key ID: 8DE0764E1809E9FC
3 changed files with 27 additions and 7 deletions

View file

@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice.
This is where we currently are! Hence, things are just dates for now. We'll see This is where we currently are! Hence, things are just dates for now. We'll see
how much impact they have in a retrospective: how much impact they have in a retrospective:
- 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in - **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
chatbot arena. This is seen as questionably fair, since bard has internet chatbot arena. This is seen as questionably fair, since bard has internet
access. access.
- 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak. - **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
Despite only having a 4bit-quantized version, it's ahead of all current Despite only having a 4bit-quantized version, it's ahead of all current
locallamas. locallamas.
- 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6 - **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
based on top of it sets new records in vision performance. based on top of it sets new records in vision performance.
- 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar - **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
pricing. Public opinion seems to be that it's quite a bit worse that GPT4, pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
except it's less censored and much better at creative writing. except it's less censored and much better at creative writing.
- 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context! - **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
Third party testing on r/localllama shows it's effectively about to query Third party testing on r/localllama shows it's effectively about to query
very large codebases, beating out GPT4 (with 32k context) on every test. very large codebases, beating out GPT4 (with 32k context) on every test.
- 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of - **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
video. A huge amount of hype starts up around it "simulating the world", but video. A huge amount of hype starts up around it "simulating the world", but
it's only open to a very small tester group. it's only open to a very small tester group.
- **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
the mentions of a commitment to open source from their website. They revert
this change the following day, after the community backlash.

View file

@ -36,6 +36,7 @@ anyone looking to get caught up with the field.
- [Guidelines for prompting for characters](https://rentry.org/NG_CharCard) - [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
- [ChatML from OpenAI is quickly becoming the standard for - [ChatML from OpenAI is quickly becoming the standard for
prompting](https://news.ycombinator.com/item?id=34988748) prompting](https://news.ycombinator.com/item?id=34988748)
- [Chasm - multiplayer text generation game](https://chasm.run/)
#### Training #### Training
- [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language) - [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)

View file

@ -1,7 +1,7 @@
--- ---
title: 'Llama Quantization Methods' title: 'Llama Quantization Methods'
description: 'A short overview of modern quantization methods in language models' description: 'A short overview of modern quantization methods in language models'
updateDate: 'Dec 31 2023' updateDate: 'March 10 2024'
heroImage: '/images/llama/pink-llama.avif' heroImage: '/images/llama/pink-llama.avif'
--- ---
@ -64,6 +64,16 @@ Cons:
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't - Quantization into GGUF can fail, meaning some bleeding-edge models aren't
available in this format. available in this format.
Being the most popular local quant, GGUF has several internal versions. The
original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to
the same precision. K-quants are more recent and don't quantize uniformly. Some
layers are quantized more, some less, and bits can be shared between weights.
For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024,
I-quants were [also
introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`).
I-quants have some more CPU-heavy work which means they can run much slower than
K-quants in some cases, but faster in others.
## GPTQ ## GPTQ
GPTQ is the standard for models that are fully loaded into the VRAM. If you have GPTQ is the standard for models that are fully loaded into the VRAM. If you have
@ -105,3 +115,9 @@ Cons:
- Can't run any model that exceeds VRAM capacity. - Can't run any model that exceeds VRAM capacity.
- The format is new, so older models will often not have AWQ pre-quantization - The format is new, so older models will often not have AWQ pre-quantization
done for them. done for them.
### Sources
- [GGUF quantization
thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
- [GGUF quantization gist with
numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)