From 9128bf5799962b4df79bb1d5f76b6d5913aacec1 Mon Sep 17 00:00:00 2001
From: Akemi Izuko <akemi@noway.moe>
Date: Sun, 31 Dec 2023 18:02:54 -0700
Subject: [PATCH] Unix: add a history of llamas

---
 src/content/unix/a-history-of-llamas.md | 167 ++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 src/content/unix/a-history-of-llamas.md
diff --git a/src/content/unix/a-history-of-llamas.md b/src/content/unix/a-history-of-llamas.md
new file mode 100644
index 0000000..380e1fd
--- /dev/null
+++ b/src/content/unix/a-history-of-llamas.md
@@ -0,0 +1,167 @@
+---
+title: 'A Brief History of Local Llamas'
+description: 'A Brief History of Local Llamas'
+updateDate: 'Dec 31 2023'
+heroImage: '/images/tiny-llama-logo.avif'
+---
+
+<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
+    "TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
+    TinyLlama project</a>. Licensed under Apache 2.0
+</p>
+
+## My Background
+
+I've been taking machine learning courses throughout the "modern history" of
+llamas. When ChatGPT was first released, we bought in a guest lecturer on NLP
+methods of the time. Since then, I've also taken an NLP course, though not one
+focused on deep learning.
+
+Most my knowledge of this field comes from a few guest lectures, and the
+indispensable [r/locallama](https://www.reddit.com/r/LocalLLaMA/) community,
+which always has the latest news about local llamas. I've become a fan of the
+local llama movement in December 2023, so the "important points" covered here
+are coming from a retrospective.
+
+I use the terms "large language model" and "llama" interchangeably, throughout
+this piece. I write "open source and locally hosted llama" as "local llama".
+Whenever you see numbers 7B, that means the llama has 7 billion parameters. More
+parameters means the model is smarter but bigger.
+
+## Modern History
+
+The modern history of llamas begins in 2022. 90% of sources llama papers cite
+these days are from 2022 onwards.
+
+Here's a brief timeline:
+
+ 1. **March 2022**: InstructGPT paper is pre-print.
+ 2. **November 2022**: ChatGPT is released.
+ 3. **March 2023**: LLaMa (open source) and GPT4 is released.
+ 4. **July 2023**: LLaMa 2 is released, alongside GGUF quantization.
+ 5. **August 2023**: AWQ quantization paper.
+ 6. **September 2023**: Mistral 7B is released.
+ 7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama.
+
+#### Early 2022
+
+In March 2022, OpenAI released a paper improving the conversational ability of
+their then uncontested GPT3. Interest in
+[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the
+academic community. As of writing, InstructGPT remains the last major paper
+OpenAI released on this topic.
+
+The remainder of 2022 was mostly focused on text-to-image models. OpenAI's
+[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source
+community kept pace with [Stable
+Diffusion](https://github.com/Stability-AI/generative-models).
+[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022,
+and causes a huge amount of controversy with artists by [winning the Colorado
+State
+Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial)
+fine-art contest.
+
+#### Late 2022
+
+In late November 2022, OpenAI [released
+ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a
+larger version of InstructGPT. This model is single-handedly responsible for the
+NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in
+general. It was considered a form of disruptive innovation in the search engine
+market, leading Google to hold record layoffs in [January
+2023](https://blog.google/inside-google/message-ceo/january-update/).
+
+At this point, the local llama movement practically didn't exist. Front-ends,
+especially for chatbot role-play as later exemplified by
+[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development,
+but they were all still running through the OpenAI's ChatGPT API.
+
+#### Early 2023
+
+In March 2023, Meta kick-started the local llama movement, by releasing
+[LlaMa](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B
+parameter llama that was open source! Benchmarks aside, it was not very good.
+ChatGPT continued to be viewed as considerably better. However, it provided a
+strong base for further iteration, and gave the local llama community a much
+stronger model than any other local llama at the time.
+
+Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a
+model that undeniably broke through all records. In fact, GPT4 remains the best
+llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5,
+to disambiguate it from GPT4. This caused much of the local llama community to
+continue focusing on building frontends, while using GPT4's API for inference.
+Nothing open source was even remotely close to GPT4 at this point.
+
+#### Mid 2023
+
+We finally see the local llama movement really take off around August 2023. Meta
+released [LLaMa2](https://ai.meta.com/blog/llama-2/), which has decent
+performance even on its 7B version. One key contribution of LLaMa2 was the GGUF
+quantization format. This format allows a model to be run on a mix of RAM and
+VRAM, which meant many home computers could now run 4-bit quantized 7B models!
+Previously, most enthusiasts would have to rent cloud GPUs to run their "local"
+llamas. Quantizing into GGUF is a very expensive process, so
+[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto
+source for pre-quantized llamas.
+
+Based on LLaMa, the open source
+[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local
+llama inference backends. Its support extends far beyond only running LLaMa2,
+it's the first major backend to support running GGUF quantizations!
+
+In addition, the [Activation-aware Weight
+Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this
+time. It uses mixed-quantization to increase both the speed and performance of
+quantized models. This is especially true for very heavily quantized models like
+4-bit quantization, that has become the standard for the local llama
+community at this point. AWQ lacks support anywhere at this time.
+
+In late September 2023, out of nowhere came a French startup with a 7B model
+that made leaps on top of LLaMa2 7B.
+[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local
+llama until mid-December 2023. Huge work in improving model tuning, particularly
+character creation and code-assistant models, is done on top of Mistral 7B.
+
+#### Late 2023
+
+It's worth noting that at this point, there is *still* no local competitor to
+GPT3.5. But that was all about to change on December 11th 2023, when Mistral
+released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is
+the first major local llama to use the same technique as GPT4; a mixture of
+experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of
+GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user
+reviews. This is hailed as a landmark achievement by the local llama community,
+demonstrating that open source models are able to compete with commercially
+developed ones. The achievement is amplified by comparing Mixtral against Google
+freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/)
+models in the same week. R/localllama reviews generally suggest Mixtral pulls
+ahead of Gemini in conversational tasks.
+
+Apple unexpectedly joins the local llama movement, by open-sourcing their
+[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model
+builds upon LLaVA, previously the leading multi-modal llama for images in the
+local llama community.
+
+In very late December, llama.cpp [merges
+AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming
+year, I expect AWQ to largely replace GPTQ in the local llama community, though
+GGUF will remain more popular in general.
+
+Just 3 days from the end of the year,
+[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion
+token checkpoint, on their 1.1B model. This miniscule model sets a new
+lower-bound for the number of neurons required to make a capable llama, enabling
+more users to locally host their llama. In practice, TinyLlama easily goes
+toe-to-toe with Microsoft's closed-source [Phi-2
+2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier.
+This is a huge win for the open source community, demonstrating how open source
+models can get ahead of commercial ones.
+
+#### Going Forward
+
+With the release of Mixtral, the local llama community is hoping 2024 will be
+the turning point where open source models finally break ahead of commercial
+models. However as of writing, it's very unclear how the community will break
+through GPT4, the llama that remains uncontested in practice.
+
+