From cc7724eeea88795c1cffb649730a9a86e9f4dd3b Mon Sep 17 00:00:00 2001 From: Akemi Izuko Date: Sun, 10 Mar 2024 18:01:05 -0600 Subject: [PATCH] Update gguf quants blog --- src/content/llama/a-history-of-llamas.md | 15 +++++++++------ src/content/llama/localllama_links.md | 1 + src/content/llama/quantization.md | 18 +++++++++++++++++- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/src/content/llama/a-history-of-llamas.md b/src/content/llama/a-history-of-llamas.md index f671bbf..e4f38f6 100644 --- a/src/content/llama/a-history-of-llamas.md +++ b/src/content/llama/a-history-of-llamas.md @@ -166,20 +166,23 @@ through GPT4, the llama that remains uncontested in practice. This is where we currently are! Hence, things are just dates for now. We'll see how much impact they have in a retrospective: - - 2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in + - **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in chatbot arena. This is seen as questionably fair, since bard has internet access. - - 2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak. + - **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak. Despite only having a 4bit-quantized version, it's ahead of all current locallamas. - - 2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6 + - **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6 based on top of it sets new records in vision performance. - - 2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar + - **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar pricing. Public opinion seems to be that it's quite a bit worse that GPT4, except it's less censored and much better at creative writing. - - 2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context! + - **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context! Third party testing on r/localllama shows it's effectively about to query very large codebases, beating out GPT4 (with 32k context) on every test. - - 2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of + - **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of video. A huge amount of hype starts up around it "simulating the world", but it's only open to a very small tester group. + - **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all + the mentions of a commitment to open source from their website. They revert + this change the following day, after the community backlash. diff --git a/src/content/llama/localllama_links.md b/src/content/llama/localllama_links.md index b24d55d..fefa0f5 100644 --- a/src/content/llama/localllama_links.md +++ b/src/content/llama/localllama_links.md @@ -36,6 +36,7 @@ anyone looking to get caught up with the field. - [Guidelines for prompting for characters](https://rentry.org/NG_CharCard) - [ChatML from OpenAI is quickly becoming the standard for prompting](https://news.ycombinator.com/item?id=34988748) + - [Chasm - multiplayer text generation game](https://chasm.run/) #### Training - [Teaching llama a new language through tuning](https://www.reddit.com/r/LocalLLaMA/comments/18oc1yc/i_tried_to_teach_mistral_7b_a_new_language) diff --git a/src/content/llama/quantization.md b/src/content/llama/quantization.md index 3199c78..6023483 100644 --- a/src/content/llama/quantization.md +++ b/src/content/llama/quantization.md @@ -1,7 +1,7 @@ --- title: 'Llama Quantization Methods' description: 'A short overview of modern quantization methods in language models' -updateDate: 'Dec 31 2023' +updateDate: 'March 10 2024' heroImage: '/images/llama/pink-llama.avif' --- @@ -64,6 +64,16 @@ Cons: - Quantization into GGUF can fail, meaning some bleeding-edge models aren't available in this format. +Being the most popular local quant, GGUF has several internal versions. The +original GGUF quants (eg `Q4_0`, `Q4_1`), quantized all the weights directly to +the same precision. K-quants are more recent and don't quantize uniformly. Some +layers are quantized more, some less, and bits can be shared between weights. +For example `Q4_K_M` means it's a 4-bit K-quant of type `M`. In early 2024, +I-quants were [also +introduced](https://github.com/ggerganov/llama.cpp/pull/4773) (eg `IQ4_S`). +I-quants have some more CPU-heavy work which means they can run much slower than +K-quants in some cases, but faster in others. + ## GPTQ GPTQ is the standard for models that are fully loaded into the VRAM. If you have @@ -105,3 +115,9 @@ Cons: - Can't run any model that exceeds VRAM capacity. - The format is new, so older models will often not have AWQ pre-quantization done for them. + +### Sources + - [GGUF quantization + thread](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/) + - [GGUF quantization gist with + numbers](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)