From cc7724eeea88795c1cffb649730a9a86e9f4dd3b Mon Sep 17 00:00:00 2001
From: Akemi Izuko <akemi@noway.moe>
Date: Sun, 10 Mar 2024 18:01:05 -0600
Subject: [PATCH] Update gguf quants blog

---
 src/content/llama/a-history-of-llamas.md | 15 +++++++++------
 src/content/llama/localllama_links.md    |  1 +
 src/content/llama/quantization.md        | 18 +++++++++++++++++-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/src/content/llama/a-history-of-llamas.md b/src/content/llama/a-history-of-llamas.md
index f671bbf..e4f38f6 100644
--- a/src/content/llama/a-history-of-llamas.md
+++ b/src/content/llama/a-history-of-llamas.md
@@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice.
 This is where we currently are! Hence, things are just dates for now. We'll see
 how much impact they have in a retrospective:
 
- - 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
+ - **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
    chatbot arena. This is seen as questionably fair, since bard has internet
    access.
- - 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak.
+ - **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
    Despite only having a 4bit-quantized version, it's ahead of all current
    locallamas.
- - 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
+ - **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
    based on top of it sets new records in vision performance.
- - 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar
+ - **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
    pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
    except it's less censored and much better at creative writing.
- - 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context!
+ - **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
    Third party testing on r/localllama shows it's effectively about to query
    very large codebases, beating out GPT4 (with 32k context) on every test.
- - 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of
+ - **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
    video. A huge amount of hype starts up around it "simulating the world", but
    it's only open to a very small tester group.
+ - **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
+   the mentions of a commitment to open source from their website. They revert
+   this change the following day, after the community backlash.
diff --git a/src/content/llama/localllama_links.md b/src/content/llama/localllama_links.md
index b24d55d..fefa0f5 100644
--- a/src/content/llama/localllama_links.md
+++ b/src/content/llama/localllama_links.md
@@ -36,6 +36,7 @@ anyone looking to get caught up with the field.
  - [Guidelines for prompting for characters](https://rentry.org/NG_CharCard)
  - [ChatML from OpenAI is quickly becoming the standard for
    prompting](https://news.ycombinator.com/item?id=34988748)
+ - [Chasm - multiplayer text generation game](https://chasm.run/)
 
 #### Training
  - [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language)
diff --git a/src/content/llama/quantization.md b/src/content/llama/quantization.md
index 3199c78..6023483 100644
--- a/src/content/llama/quantization.md
+++ b/src/content/llama/quantization.md
@@ -1,7 +1,7 @@
 ---
 title: 'Llama Quantization Methods'
 description: 'A short overview of modern quantization methods in language models'
-updateDate: 'Dec 31 2023'
+updateDate: 'March 10 2024'
 heroImage: '/images/llama/pink-llama.avif'
 ---
 
@@ -64,6 +64,16 @@ Cons:
  - Quantization into GGUF can fail, meaning some bleeding-edge models aren't
    available in this format.
 
+Being the most popular local quant, GGUF has several internal versions. The
+original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to
+the same precision. K-quants are more recent and don't quantize uniformly. Some
+layers are quantized more, some less, and bits can be shared between weights.
+For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024,
+I-quants were [also
+introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`).
+I-quants have some more CPU-heavy work which means they can run much slower than
+K-quants in some cases, but faster in others.
+
 ## GPTQ
 
 GPTQ is the standard for models that are fully loaded into the VRAM. If you have
@@ -105,3 +115,9 @@ Cons:
  - Can't run any model that exceeds VRAM capacity.
  - The format is new, so older models will often not have AWQ pre-quantization
    done for them.
+
+### Sources
+ - [GGUF quantization
+   thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/)
+ - [GGUF quantization gist with
+   numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)