Update gguf quants blog

2024-03-10 18:01:05 -06:00 · 2024-03-10 18:01:05 -06:00 · cc7724eeea
commit cc7724eeea
parent 60cc4129db
3 changed files with 27 additions and 7 deletions
--- a/src/content/llama/a-history-of-llamas.md
+++ b/src/content/llama/a-history-of-llamas.md
@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice.
 This is where we currently are! Hence, things are just dates for now. We'll see
 how much impact they have in a retrospective:

- - 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
+ - **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
   chatbot arena. This is seen as questionably fair, since bard has internet
   access.
- - 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak.
+ - **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
   Despite only having a 4bit-quantized version, it's ahead of all current
   locallamas.
- - 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
+ - **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
   based on top of it sets new records in vision performance.
- - 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar
+ - **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
   pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
   except it's less censored and much better at creative writing.
- - 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context!
+ - **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
   Third party testing on r/localllama shows it's effectively about to query
   very large codebases, beating out GPT4 (with 32k context) on every test.
- - 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of
+ - **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
   video. A huge amount of hype starts up around it "simulating the world", but
   it's only open to a very small tester group.
+ - **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
+   the mentions of a commitment to open source from their website. They revert
+   this change the following day, after the community backlash.
--- a/src/content/llama/localllama_links.md
+++ b/src/content/llama/localllama_links.md
@ -36,6 +36,7 @@ anyone looking to get caught up with the field.
 - [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
 - [ChatML from OpenAI is quickly becoming the standard for
   prompting](https://news.ycombinator.com/item?id=34988748)
+ - [Chasm - multiplayer text generation game](https://chasm.run/)

 #### Training
 - [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)
--- a/src/content/llama/quantization.md
+++ b/src/content/llama/quantization.md
@ -1,7 +1,7 @@
 ---
 title: 'Llama Quantization Methods'
 description: 'A short overview of modern quantization methods in language models'
-updateDate: 'Dec 31 2023'
+updateDate: 'March 10 2024'
 heroImage: '/images/llama/pink-llama.avif'
 ---

@ -64,6 +64,16 @@ Cons:
 - Quantization into GGUF can fail, meaning some bleeding-edge models aren't
   available in this format.

+Being the most popular local quant, GGUF has several internal versions. The
+original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to
+the same precision. K-quants are more recent and don't quantize uniformly. Some
+layers are quantized more, some less, and bits can be shared between weights.
+For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024,
+I-quants were [also
+introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`).
+I-quants have some more CPU-heavy work which means they can run much slower than
+K-quants in some cases, but faster in others.
+
 ## GPTQ

 GPTQ is the standard for models that are fully loaded into the VRAM. If you have
@ -105,3 +115,9 @@ Cons:
 - Can't run any model that exceeds VRAM capacity.
 - The format is new, so older models will often not have AWQ pre-quantization
   done for them.
+
+### Sources
+ - [GGUF quantization
+   thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
+ - [GGUF quantization gist with
+   numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)