Update gguf quants blog

This commit is contained in:
Akemi Izuko 2024-03-10 18:01:05 -06:00
parent 60cc4129db
commit cc7724eeea
Signed by: akemi
GPG key ID: 8DE0764E1809E9FC
3 changed files with 27 additions and 7 deletions

View file

@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice.
This is where we currently are! Hence, things are just dates for now. We'll see
how much impact they have in a retrospective:
- 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
- **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
chatbot arena. This is seen as questionably fair, since bard has internet
access.
- 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak.
- **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
Despite only having a 4bit-quantized version, it's ahead of all current
locallamas.
- 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
- **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
based on top of it sets new records in vision performance.
- 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar
- **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
except it's less censored and much better at creative writing.
- 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context!
- **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
Third party testing on r/localllama shows it's effectively about to query
very large codebases, beating out GPT4 (with 32k context) on every test.
- 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of
- **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
video. A huge amount of hype starts up around it "simulating the world", but
it's only open to a very small tester group.
- **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
the mentions of a commitment to open source from their website. They revert
this change the following day, after the community backlash.

View file

@ -36,6 +36,7 @@ anyone looking to get caught up with the field.
- [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
- [ChatML from OpenAI is quickly becoming the standard for
prompting](https://news.ycombinator.com/item?id=34988748)
- [Chasm - multiplayer text generation game](https://chasm.run/)
#### Training
- [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)

View file

@ -1,7 +1,7 @@
---
title: 'Llama Quantization Methods'
description: 'A short overview of modern quantization methods in language models'
updateDate: 'Dec 31 2023'
updateDate: 'March 10 2024'
heroImage: '/images/llama/pink-llama.avif'
---
@ -64,6 +64,16 @@ Cons:
- Quantization into GGUF can fail, meaning some bleeding-edge models aren't
available in this format.
Being the most popular local quant, GGUF has several internal versions. The
original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to
the same precision. K-quants are more recent and don't quantize uniformly. Some
layers are quantized more, some less, and bits can be shared between weights.
For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024,
I-quants were [also
introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`).
I-quants have some more CPU-heavy work which means they can run much slower than
K-quants in some cases, but faster in others.
## GPTQ
GPTQ is the standard for models that are fully loaded into the VRAM. If you have
@ -105,3 +115,9 @@ Cons:
- Can't run any model that exceeds VRAM capacity.
- The format is new, so older models will often not have AWQ pre-quantization
done for them.
### Sources
- [GGUF quantization
thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
- [GGUF quantization gist with
numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)