193 lines
9.8 KiB
Markdown
193 lines
9.8 KiB
Markdown
---
|
|
title: 'A Brief History of Local Llamas'
|
|
description: 'A Brief History of Local Llamas'
|
|
updateDate: 'Feb 18 2024'
|
|
heroImage: '/images/llama/tiny-llama-logo.avif'
|
|
---
|
|
|
|
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
|
|
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
|
|
TinyLlama project</a>. Licensed under Apache 2.0
|
|
</p>
|
|
|
|
## My Background
|
|
|
|
Most my knowledge of this field comes from a few guest lectures, and the
|
|
indispensable [r/localllama](https://www.reddit.com/r/LocalLLaMA/) community,
|
|
which always has the latest news about local llamas. I've become a fan of the
|
|
local llama movement in December 2023, so the "important points" covered here
|
|
are coming from a retrospective.
|
|
|
|
Throughout this piece, the terms "large language model" and "llama" are used
|
|
interchangeably. Same goes for the terms "open source and locally hosted llama"
|
|
and "local llama".
|
|
|
|
Whenever you see numbers like 7B, that means the llama has 7 billion parameters.
|
|
More parameters means the model is smarter but bigger.
|
|
|
|
## Modern History
|
|
|
|
The modern history of llamas begins in 2022. 90% of sources llama papers cite
|
|
these days are from 2022 onwards.
|
|
|
|
Here's a brief timeline:
|
|
|
|
1. **March 2022**: InstructGPT paper is pre-print.
|
|
2. **November 2022**: ChatGPT is released.
|
|
3. **March 2023**: LLaMA (open source) and GPT4 is released.
|
|
4. **July 2023**: LLaMA 2 is released, alongside GGUF quantization.
|
|
5. **August 2023**: AWQ quantization paper.
|
|
6. **September 2023**: Mistral 7B is released.
|
|
7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama.
|
|
|
|
#### Early 2022
|
|
|
|
In March 2022, OpenAI released a paper improving the conversational ability of
|
|
their then uncontested GPT3. Interest in
|
|
[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the
|
|
academic community. As of writing, InstructGPT remains the last major paper
|
|
OpenAI released on this topic.
|
|
|
|
The remainder of 2022 was mostly focused on text-to-image models. OpenAI's
|
|
[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source
|
|
community kept pace with [Stable
|
|
Diffusion](https://github.com/Stability-AI/generative-models).
|
|
[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022,
|
|
and causes a huge amount of controversy with artists by [winning the Colorado
|
|
State
|
|
Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial)
|
|
fine-art contest.
|
|
|
|
#### Late 2022
|
|
|
|
In late November 2022, OpenAI [released
|
|
ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a
|
|
larger version of InstructGPT. This model is single-handedly responsible for the
|
|
NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in
|
|
general. It was considered a form of disruptive innovation in the search engine
|
|
market, leading Google to hold record layoffs in [January
|
|
2023](https://blog.google/inside-google/message-ceo/january-update/).
|
|
|
|
At this point, the local llama movement practically didn't exist. Front-ends,
|
|
especially for chatbot role-play as later exemplified by
|
|
[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development,
|
|
but they were all still running through the OpenAI's ChatGPT API.
|
|
|
|
#### Early 2023
|
|
|
|
In March 2023, Meta kick-started the local llama movement, by releasing
|
|
[LLaMA](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B
|
|
parameter llama that was open source! Benchmarks aside, it was not very good.
|
|
ChatGPT continued to be viewed as considerably better. However, it provided a
|
|
strong base for further iteration, and gave the local llama community a much
|
|
stronger model than any other local llama at the time.
|
|
|
|
Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a
|
|
model that undeniably broke through all records. In fact, GPT4 remains the best
|
|
llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5,
|
|
to disambiguate it from GPT4. This caused much of the local llama community to
|
|
continue focusing on building frontends, while using GPT4's API for inference.
|
|
Nothing open source was even remotely close to GPT4 at this point.
|
|
|
|
#### Mid 2023
|
|
|
|
We finally see the local llama movement really take off around August 2023. Meta
|
|
released [LLaMA2](https://ai.meta.com/blog/llama-2/), which has decent
|
|
performance even on its 7B version. One key contribution of LLaMA2 was the GGUF
|
|
quantization format. This format allows a model to be run on a mix of RAM and
|
|
VRAM, which meant many home computers could now run 4-bit quantized 7B models!
|
|
Previously, most enthusiasts would have to rent cloud GPUs to run their "local"
|
|
llamas. Quantizing into GGUF is a very expensive process, so
|
|
[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto
|
|
source for [pre-quantized llamas](../quantization).
|
|
|
|
Based on LLaMA, the open source
|
|
[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local
|
|
llama inference backends. Its support extends far beyond only running LLaMA2,
|
|
it's the first major backend to support running GGUF quantizations!
|
|
|
|
In addition, the [Activation-aware Weight
|
|
Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this
|
|
time. It uses mixed-quantization to increase both the speed and performance of
|
|
quantized models. This is especially true for very heavily quantized models like
|
|
4-bit quantization, that has become the standard for the local llama
|
|
community at this point. AWQ lacks support anywhere at this time.
|
|
|
|
In late September 2023, out of nowhere came a French startup with a 7B model
|
|
that made leaps on top of LLaMA2 7B.
|
|
[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local
|
|
llama until mid-December 2023. Huge work in improving model tuning, particularly
|
|
character creation and code-assistant models, is done on top of Mistral 7B.
|
|
|
|
#### Late 2023
|
|
|
|
It's worth noting that at this point, there is *still* no local competitor to
|
|
GPT3.5. But that was all about to change on December 11th 2023, when Mistral
|
|
released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is
|
|
the first major local llama to use the same technique as GPT4; a mixture of
|
|
experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of
|
|
GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user
|
|
reviews. This is hailed as a landmark achievement by the local llama community,
|
|
demonstrating that open source models are able to compete with commercially
|
|
developed ones. The achievement is amplified by comparing Mixtral against Google
|
|
freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/)
|
|
models in the same week. R/localllama reviews generally suggest Mixtral pulls
|
|
ahead of Gemini in conversational tasks.
|
|
|
|
Apple unexpectedly joins the local llama movement, by open-sourcing their
|
|
[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model
|
|
builds upon LLaVA, previously the leading multi-modal llama for images in the
|
|
local llama community.
|
|
|
|
In very late December, llama.cpp [merges
|
|
AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming
|
|
year, I expect AWQ to largely replace GPTQ in the local llama community, though
|
|
GGUF will remain more popular in general.
|
|
|
|
Just 3 days from the end of the year,
|
|
[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion
|
|
token checkpoint, on their 1.1B model. This miniscule model sets a new
|
|
lower-bound for the number of neurons required to make a capable llama, enabling
|
|
more users to locally host their llama. In practice, TinyLlama easily goes
|
|
toe-to-toe with Microsoft's closed-source [Phi-2
|
|
2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier.
|
|
This is a huge win for the open source community, demonstrating how open source
|
|
models can get ahead of commercial ones.
|
|
|
|
#### Going Forward
|
|
|
|
With the release of Mixtral, the local llama community is hoping 2024 will be
|
|
the turning point where open source models finally break ahead of commercial
|
|
models. However as of writing, it's very unclear how the community will break
|
|
through GPT4, the llama that remains uncontested in practice.
|
|
|
|
#### Early 2024
|
|
|
|
This is where we currently are! Hence, things are just dates for now. We'll see
|
|
how much impact they have in a retrospective:
|
|
|
|
- **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
|
|
chatbot arena. This is seen as questionably fair, since bard has internet
|
|
access.
|
|
- **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
|
|
Despite only having a 4bit-quantized version, it's ahead of all current
|
|
locallamas.
|
|
- **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
|
|
based on top of it sets new records in vision performance.
|
|
- **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
|
|
pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
|
|
except it's less censored and much better at creative writing.
|
|
- **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
|
|
Third party testing on r/localllama shows it's effectively about to query
|
|
very large codebases, beating out GPT4 (with 32k context) on every test.
|
|
- **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
|
|
video. A huge amount of hype starts up around it "simulating the world", but
|
|
it's only open to a very small tester group.
|
|
- **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
|
|
the mentions of a commitment to open source from their website. They revert
|
|
this change the following day, after the community backlash.
|
|
- **2024-03-27**: DataBricks open sources DBRX, a 132B parameter MoE with 36B
|
|
parameters active per forward pass. It was trained on 12T tokens. According
|
|
to user evaluation, it beats Mixtral for all uses.
|
|
- **2024-04-18**: Meta releases LLaMA3 8b and 70b. 70b is the new best open
|
|
model, right around Claude3 Sonnet and above older gpt4 versions!
|