Update gguf quants blog
This commit is contained in:
parent
60cc4129db
commit
cc7724eeea
3 changed files with 27 additions and 7 deletions
|
@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice.
|
||||||
This is where we currently are! Hence, things are just dates for now. We'll see
|
This is where we currently are! Hence, things are just dates for now. We'll see
|
||||||
how much impact they have in a retrospective:
|
how much impact they have in a retrospective:
|
||||||
|
|
||||||
- 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
|
- **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
|
||||||
chatbot arena. This is seen as questionably fair, since bard has internet
|
chatbot arena. This is seen as questionably fair, since bard has internet
|
||||||
access.
|
access.
|
||||||
- 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak.
|
- **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
|
||||||
Despite only having a 4bit-quantized version, it's ahead of all current
|
Despite only having a 4bit-quantized version, it's ahead of all current
|
||||||
locallamas.
|
locallamas.
|
||||||
- 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
|
- **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
|
||||||
based on top of it sets new records in vision performance.
|
based on top of it sets new records in vision performance.
|
||||||
- 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar
|
- **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
|
||||||
pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
|
pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
|
||||||
except it's less censored and much better at creative writing.
|
except it's less censored and much better at creative writing.
|
||||||
- 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context!
|
- **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
|
||||||
Third party testing on r/localllama shows it's effectively about to query
|
Third party testing on r/localllama shows it's effectively about to query
|
||||||
very large codebases, beating out GPT4 (with 32k context) on every test.
|
very large codebases, beating out GPT4 (with 32k context) on every test.
|
||||||
- 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of
|
- **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
|
||||||
video. A huge amount of hype starts up around it "simulating the world", but
|
video. A huge amount of hype starts up around it "simulating the world", but
|
||||||
it's only open to a very small tester group.
|
it's only open to a very small tester group.
|
||||||
|
- **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
|
||||||
|
the mentions of a commitment to open source from their website. They revert
|
||||||
|
this change the following day, after the community backlash.
|
||||||
|
|
|
@ -36,6 +36,7 @@ anyone looking to get caught up with the field.
|
||||||
- [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
|
- [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
|
||||||
- [ChatML from OpenAI is quickly becoming the standard for
|
- [ChatML from OpenAI is quickly becoming the standard for
|
||||||
prompting](https://news.ycombinator.com/item?id=34988748)
|
prompting](https://news.ycombinator.com/item?id=34988748)
|
||||||
|
- [Chasm - multiplayer text generation game](https://chasm.run/)
|
||||||
|
|
||||||
#### Training
|
#### Training
|
||||||
- [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)
|
- [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
---
|
---
|
||||||
title: 'Llama Quantization Methods'
|
title: 'Llama Quantization Methods'
|
||||||
description: 'A short overview of modern quantization methods in language models'
|
description: 'A short overview of modern quantization methods in language models'
|
||||||
updateDate: 'Dec 31 2023'
|
updateDate: 'March 10 2024'
|
||||||
heroImage: '/images/llama/pink-llama.avif'
|
heroImage: '/images/llama/pink-llama.avif'
|
||||||
---
|
---
|
||||||
|
|
||||||
|
@ -64,6 +64,16 @@ Cons:
|
||||||
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
|
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
|
||||||
available in this format.
|
available in this format.
|
||||||
|
|
||||||
|
Being the most popular local quant, GGUF has several internal versions. The
|
||||||
|
original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to
|
||||||
|
the same precision. K-quants are more recent and don't quantize uniformly. Some
|
||||||
|
layers are quantized more, some less, and bits can be shared between weights.
|
||||||
|
For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024,
|
||||||
|
I-quants were [also
|
||||||
|
introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`).
|
||||||
|
I-quants have some more CPU-heavy work which means they can run much slower than
|
||||||
|
K-quants in some cases, but faster in others.
|
||||||
|
|
||||||
## GPTQ
|
## GPTQ
|
||||||
|
|
||||||
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
|
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
|
||||||
|
@ -105,3 +115,9 @@ Cons:
|
||||||
- Can't run any model that exceeds VRAM capacity.
|
- Can't run any model that exceeds VRAM capacity.
|
||||||
- The format is new, so older models will often not have AWQ pre-quantization
|
- The format is new, so older models will often not have AWQ pre-quantization
|
||||||
done for them.
|
done for them.
|
||||||
|
|
||||||
|
### Sources
|
||||||
|
- [GGUF quantization
|
||||||
|
thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
|
||||||
|
- [GGUF quantization gist with
|
||||||
|
numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
|
||||||
|
|
Loading…
Reference in a new issue