From 9128bf5799962b4df79bb1d5f76b6d5913aacec1 Mon Sep 17 00:00:00 2001 From: Akemi Izuko Date: Sun, 31 Dec 2023 18:02:54 -0700 Subject: [PATCH] Unix: add a history of llamas --- src/content/unix/a-history-of-llamas.md | 167 ++++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 src/content/unix/a-history-of-llamas.md diff --git a/src/content/unix/a-history-of-llamas.md b/src/content/unix/a-history-of-llamas.md new file mode 100644 index 0000000..380e1fd --- /dev/null +++ b/src/content/unix/a-history-of-llamas.md @@ -0,0 +1,167 @@ +--- +title: 'A Brief History of Local Llamas' +description: 'A Brief History of Local Llamas' +updateDate: 'Dec 31 2023' +heroImage: '/images/tiny-llama-logo.avif' +--- + +

+ "TinyLlama logo" by The + TinyLlama project. Licensed under Apache 2.0 +

+ +## My Background + +I've been taking machine learning courses throughout the "modern history" of +llamas. When ChatGPT was first released, we bought in a guest lecturer on NLP +methods of the time. Since then, I've also taken an NLP course, though not one +focused on deep learning. + +Most my knowledge of this field comes from a few guest lectures, and the +indispensable [r/locallama](https://www.reddit.com/r/LocalLLaMA/) community, +which always has the latest news about local llamas. I've become a fan of the +local llama movement in December 2023, so the "important points" covered here +are coming from a retrospective. + +I use the terms "large language model" and "llama" interchangeably, throughout +this piece. I write "open source and locally hosted llama" as "local llama". +Whenever you see numbers 7B, that means the llama has 7 billion parameters. More +parameters means the model is smarter but bigger. + +## Modern History + +The modern history of llamas begins in 2022. 90% of sources llama papers cite +these days are from 2022 onwards. + +Here's a brief timeline: + + 1. **March 2022**: InstructGPT paper is pre-print. + 2. **November 2022**: ChatGPT is released. + 3. **March 2023**: LLaMa (open source) and GPT4 is released. + 4. **July 2023**: LLaMa 2 is released, alongside GGUF quantization. + 5. **August 2023**: AWQ quantization paper. + 6. **September 2023**: Mistral 7B is released. + 7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama. + +#### Early 2022 + +In March 2022, OpenAI released a paper improving the conversational ability of +their then uncontested GPT3. Interest in +[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the +academic community. As of writing, InstructGPT remains the last major paper +OpenAI released on this topic. + +The remainder of 2022 was mostly focused on text-to-image models. OpenAI's +[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source +community kept pace with [Stable +Diffusion](https://github.com/Stability-AI/generative-models). +[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022, +and causes a huge amount of controversy with artists by [winning the Colorado +State +Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial) +fine-art contest. + +#### Late 2022 + +In late November 2022, OpenAI [released +ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a +larger version of InstructGPT. This model is single-handedly responsible for the +NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in +general. It was considered a form of disruptive innovation in the search engine +market, leading Google to hold record layoffs in [January +2023](https://blog.google/inside-google/message-ceo/january-update/). + +At this point, the local llama movement practically didn't exist. Front-ends, +especially for chatbot role-play as later exemplified by +[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development, +but they were all still running through the OpenAI's ChatGPT API. + +#### Early 2023 + +In March 2023, Meta kick-started the local llama movement, by releasing +[LlaMa](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B +parameter llama that was open source! Benchmarks aside, it was not very good. +ChatGPT continued to be viewed as considerably better. However, it provided a +strong base for further iteration, and gave the local llama community a much +stronger model than any other local llama at the time. + +Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a +model that undeniably broke through all records. In fact, GPT4 remains the best +llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5, +to disambiguate it from GPT4. This caused much of the local llama community to +continue focusing on building frontends, while using GPT4's API for inference. +Nothing open source was even remotely close to GPT4 at this point. + +#### Mid 2023 + +We finally see the local llama movement really take off around August 2023. Meta +released [LLaMa2](https://ai.meta.com/blog/llama-2/), which has decent +performance even on its 7B version. One key contribution of LLaMa2 was the GGUF +quantization format. This format allows a model to be run on a mix of RAM and +VRAM, which meant many home computers could now run 4-bit quantized 7B models! +Previously, most enthusiasts would have to rent cloud GPUs to run their "local" +llamas. Quantizing into GGUF is a very expensive process, so +[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto +source for pre-quantized llamas. + +Based on LLaMa, the open source +[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local +llama inference backends. Its support extends far beyond only running LLaMa2, +it's the first major backend to support running GGUF quantizations! + +In addition, the [Activation-aware Weight +Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this +time. It uses mixed-quantization to increase both the speed and performance of +quantized models. This is especially true for very heavily quantized models like +4-bit quantization, that has become the standard for the local llama +community at this point. AWQ lacks support anywhere at this time. + +In late September 2023, out of nowhere came a French startup with a 7B model +that made leaps on top of LLaMa2 7B. +[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local +llama until mid-December 2023. Huge work in improving model tuning, particularly +character creation and code-assistant models, is done on top of Mistral 7B. + +#### Late 2023 + +It's worth noting that at this point, there is *still* no local competitor to +GPT3.5. But that was all about to change on December 11th 2023, when Mistral +released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is +the first major local llama to use the same technique as GPT4; a mixture of +experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of +GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user +reviews. This is hailed as a landmark achievement by the local llama community, +demonstrating that open source models are able to compete with commercially +developed ones. The achievement is amplified by comparing Mixtral against Google +freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/) +models in the same week. R/localllama reviews generally suggest Mixtral pulls +ahead of Gemini in conversational tasks. + +Apple unexpectedly joins the local llama movement, by open-sourcing their +[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model +builds upon LLaVA, previously the leading multi-modal llama for images in the +local llama community. + +In very late December, llama.cpp [merges +AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming +year, I expect AWQ to largely replace GPTQ in the local llama community, though +GGUF will remain more popular in general. + +Just 3 days from the end of the year, +[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion +token checkpoint, on their 1.1B model. This miniscule model sets a new +lower-bound for the number of neurons required to make a capable llama, enabling +more users to locally host their llama. In practice, TinyLlama easily goes +toe-to-toe with Microsoft's closed-source [Phi-2 +2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier. +This is a huge win for the open source community, demonstrating how open source +models can get ahead of commercial ones. + +#### Going Forward + +With the release of Mixtral, the local llama community is hoping 2024 will be +the turning point where open source models finally break ahead of commercial +models. However as of writing, it's very unclear how the community will break +through GPT4, the llama that remains uncontested in practice. + +