Unix: add a history of llamas

This commit is contained in:
Akemi Izuko 2023-12-31 18:02:54 -07:00
parent e509097d20
commit 9128bf5799
Signed by: akemi
GPG key ID: 8DE0764E1809E9FC

View file

@ -0,0 +1,167 @@
---
title: 'A Brief History of Local Llamas'
description: 'A Brief History of Local Llamas'
updateDate: 'Dec 31 2023'
heroImage: '/images/tiny-llama-logo.avif'
---
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
TinyLlama project</a>. Licensed under Apache 2.0
</p>
## My Background
I've been taking machine learning courses throughout the "modern history" of
llamas. When ChatGPT was first released, we bought in a guest lecturer on NLP
methods of the time. Since then, I've also taken an NLP course, though not one
focused on deep learning.
Most my knowledge of this field comes from a few guest lectures, and the
indispensable [r/locallama](https://www.reddit.com/r/LocalLLaMA/) community,
which always has the latest news about local llamas. I've become a fan of the
local llama movement in December 2023, so the "important points" covered here
are coming from a retrospective.
I use the terms "large language model" and "llama" interchangeably, throughout
this piece. I write "open source and locally hosted llama" as "local llama".
Whenever you see numbers 7B, that means the llama has 7 billion parameters. More
parameters means the model is smarter but bigger.
## Modern History
The modern history of llamas begins in 2022. 90% of sources llama papers cite
these days are from 2022 onwards.
Here's a brief timeline:
1. **March 2022**: InstructGPT paper is pre-print.
2. **November 2022**: ChatGPT is released.
3. **March 2023**: LLaMa (open source) and GPT4 is released.
4. **July 2023**: LLaMa 2 is released, alongside GGUF quantization.
5. **August 2023**: AWQ quantization paper.
6. **September 2023**: Mistral 7B is released.
7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama.
#### Early 2022
In March 2022, OpenAI released a paper improving the conversational ability of
their then uncontested GPT3. Interest in
[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the
academic community. As of writing, InstructGPT remains the last major paper
OpenAI released on this topic.
The remainder of 2022 was mostly focused on text-to-image models. OpenAI's
[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source
community kept pace with [Stable
Diffusion](https://github.com/Stability-AI/generative-models).
[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022,
and causes a huge amount of controversy with artists by [winning the Colorado
State
Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial)
fine-art contest.
#### Late 2022
In late November 2022, OpenAI [released
ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a
larger version of InstructGPT. This model is single-handedly responsible for the
NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in
general. It was considered a form of disruptive innovation in the search engine
market, leading Google to hold record layoffs in [January
2023](https://blog.google/inside-google/message-ceo/january-update/).
At this point, the local llama movement practically didn't exist. Front-ends,
especially for chatbot role-play as later exemplified by
[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development,
but they were all still running through the OpenAI's ChatGPT API.
#### Early 2023
In March 2023, Meta kick-started the local llama movement, by releasing
[LlaMa](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B
parameter llama that was open source! Benchmarks aside, it was not very good.
ChatGPT continued to be viewed as considerably better. However, it provided a
strong base for further iteration, and gave the local llama community a much
stronger model than any other local llama at the time.
Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a
model that undeniably broke through all records. In fact, GPT4 remains the best
llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5,
to disambiguate it from GPT4. This caused much of the local llama community to
continue focusing on building frontends, while using GPT4's API for inference.
Nothing open source was even remotely close to GPT4 at this point.
#### Mid 2023
We finally see the local llama movement really take off around August 2023. Meta
released [LLaMa2](https://ai.meta.com/blog/llama-2/), which has decent
performance even on its 7B version. One key contribution of LLaMa2 was the GGUF
quantization format. This format allows a model to be run on a mix of RAM and
VRAM, which meant many home computers could now run 4-bit quantized 7B models!
Previously, most enthusiasts would have to rent cloud GPUs to run their "local"
llamas. Quantizing into GGUF is a very expensive process, so
[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto
source for pre-quantized llamas.
Based on LLaMa, the open source
[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local
llama inference backends. Its support extends far beyond only running LLaMa2,
it's the first major backend to support running GGUF quantizations!
In addition, the [Activation-aware Weight
Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this
time. It uses mixed-quantization to increase both the speed and performance of
quantized models. This is especially true for very heavily quantized models like
4-bit quantization, that has become the standard for the local llama
community at this point. AWQ lacks support anywhere at this time.
In late September 2023, out of nowhere came a French startup with a 7B model
that made leaps on top of LLaMa2 7B.
[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local
llama until mid-December 2023. Huge work in improving model tuning, particularly
character creation and code-assistant models, is done on top of Mistral 7B.
#### Late 2023
It's worth noting that at this point, there is *still* no local competitor to
GPT3.5. But that was all about to change on December 11th 2023, when Mistral
released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is
the first major local llama to use the same technique as GPT4; a mixture of
experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of
GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user
reviews. This is hailed as a landmark achievement by the local llama community,
demonstrating that open source models are able to compete with commercially
developed ones. The achievement is amplified by comparing Mixtral against Google
freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/)
models in the same week. R/localllama reviews generally suggest Mixtral pulls
ahead of Gemini in conversational tasks.
Apple unexpectedly joins the local llama movement, by open-sourcing their
[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model
builds upon LLaVA, previously the leading multi-modal llama for images in the
local llama community.
In very late December, llama.cpp [merges
AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming
year, I expect AWQ to largely replace GPTQ in the local llama community, though
GGUF will remain more popular in general.
Just 3 days from the end of the year,
[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion
token checkpoint, on their 1.1B model. This miniscule model sets a new
lower-bound for the number of neurons required to make a capable llama, enabling
more users to locally host their llama. In practice, TinyLlama easily goes
toe-to-toe with Microsoft's closed-source [Phi-2
2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier.
This is a huge win for the open source community, demonstrating how open source
models can get ahead of commercial ones.
#### Going Forward
With the release of Mixtral, the local llama community is hoping 2024 will be
the turning point where open source models finally break ahead of commercial
models. However as of writing, it's very unclear how the community will break
through GPT4, the llama that remains uncontested in practice.