noway.moe/src/content/llama/a-history-of-llamas.md

---
title: 'A Brief History of Local Llamas'
description: 'A Brief History of Local Llamas'
updateDate: 'Feb 18 2024'
heroImage: '/images/llama/tiny-llama-logo.avif'
---

<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
    "TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
    TinyLlama project</a>. Licensed under Apache 2.0
</p>

## My Background

Most my knowledge of this field comes from a few guest lectures, and the
indispensable [r/localllama](https://www.reddit.com/r/LocalLLaMA/) community,
which always has the latest news about local llamas. I've become a fan of the
local llama movement in December 2023, so the "important points" covered here
are coming from a retrospective.

Throughout this piece, the terms "large language model" and "llama" are used
interchangeably. Same goes for the terms "open source and locally hosted llama"
and "local llama".

Whenever you see numbers like 7B, that means the llama has 7 billion parameters.
More parameters means the model is smarter but bigger.

## Modern History

The modern history of llamas begins in 2022. 90% of sources llama papers cite
these days are from 2022 onwards.

Here's a brief timeline:

 1. **March 2022**: InstructGPT paper is pre-print.
 2. **November 2022**: ChatGPT is released.
 3. **March 2023**: LLaMA (open source) and GPT4 is released.
 4. **July 2023**: LLaMA 2 is released, alongside GGUF quantization.
 5. **August 2023**: AWQ quantization paper.
 6. **September 2023**: Mistral 7B is released.
 7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama.

#### Early 2022

In March 2022, OpenAI released a paper improving the conversational ability of
their then uncontested GPT3. Interest in
[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the
academic community. As of writing, InstructGPT remains the last major paper
OpenAI released on this topic.

The remainder of 2022 was mostly focused on text-to-image models. OpenAI's
[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source
community kept pace with [Stable
Diffusion](https://github.com/Stability-AI/generative-models).
[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022,
and causes a huge amount of controversy with artists by [winning the Colorado
State
Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial)
fine-art contest.

#### Late 2022

In late November 2022, OpenAI [released
ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a
larger version of InstructGPT. This model is single-handedly responsible for the
NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in
general. It was considered a form of disruptive innovation in the search engine
market, leading Google to hold record layoffs in [January
2023](https://blog.google/inside-google/message-ceo/january-update/).

At this point, the local llama movement practically didn't exist. Front-ends,
especially for chatbot role-play as later exemplified by
[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development,
but they were all still running through the OpenAI's ChatGPT API.

#### Early 2023

In March 2023, Meta kick-started the local llama movement, by releasing
[LLaMA](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B
parameter llama that was open source! Benchmarks aside, it was not very good.
ChatGPT continued to be viewed as considerably better. However, it provided a
strong base for further iteration, and gave the local llama community a much
stronger model than any other local llama at the time.

Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a
model that undeniably broke through all records. In fact, GPT4 remains the best
llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5,
to disambiguate it from GPT4. This caused much of the local llama community to
continue focusing on building frontends, while using GPT4's API for inference.
Nothing open source was even remotely close to GPT4 at this point.

#### Mid 2023

We finally see the local llama movement really take off around August 2023. Meta
released [LLaMA2](https://ai.meta.com/blog/llama-2/), which has decent
performance even on its 7B version. One key contribution of LLaMA2 was the GGUF
quantization format. This format allows a model to be run on a mix of RAM and
VRAM, which meant many home computers could now run 4-bit quantized 7B models!
Previously, most enthusiasts would have to rent cloud GPUs to run their "local"
llamas. Quantizing into GGUF is a very expensive process, so
[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto
source for [pre-quantized llamas](../quantization).

Based on LLaMA, the open source
[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local
llama inference backends. Its support extends far beyond only running LLaMA2,
it's the first major backend to support running GGUF quantizations!

In addition, the [Activation-aware Weight
Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this
time. It uses mixed-quantization to increase both the speed and performance of
quantized models. This is especially true for very heavily quantized models like
4-bit quantization, that has become the standard for the local llama
community at this point. AWQ lacks support anywhere at this time.

In late September 2023, out of nowhere came a French startup with a 7B model
that made leaps on top of LLaMA2 7B.
[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local
llama until mid-December 2023. Huge work in improving model tuning, particularly
character creation and code-assistant models, is done on top of Mistral 7B.

#### Late 2023

It's worth noting that at this point, there is *still* no local competitor to
GPT3.5. But that was all about to change on December 11th 2023, when Mistral
released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is
the first major local llama to use the same technique as GPT4; a mixture of
experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of
GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user
reviews. This is hailed as a landmark achievement by the local llama community,
demonstrating that open source models are able to compete with commercially
developed ones. The achievement is amplified by comparing Mixtral against Google
freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/)
models in the same week. R/localllama reviews generally suggest Mixtral pulls
ahead of Gemini in conversational tasks.

Apple unexpectedly joins the local llama movement, by open-sourcing their
[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model
builds upon LLaVA, previously the leading multi-modal llama for images in the
local llama community.

In very late December, llama.cpp [merges
AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming
year, I expect AWQ to largely replace GPTQ in the local llama community, though
GGUF will remain more popular in general.

Just 3 days from the end of the year,
[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion
token checkpoint, on their 1.1B model. This miniscule model sets a new
lower-bound for the number of neurons required to make a capable llama, enabling
more users to locally host their llama. In practice, TinyLlama easily goes
toe-to-toe with Microsoft's closed-source [Phi-2
2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier.
This is a huge win for the open source community, demonstrating how open source
models can get ahead of commercial ones.

#### Going Forward

With the release of Mixtral, the local llama community is hoping 2024 will be
the turning point where open source models finally break ahead of commercial
models. However as of writing, it's very unclear how the community will break
through GPT4, the llama that remains uncontested in practice.

#### Early 2024

This is where we currently are! Hence, things are just dates for now. We'll see
how much impact they have in a retrospective:

 - **2024-01-22**: Bard with Gemini-Pro defeats all models except GPT4-Turbo in
   chatbot arena. This is seen as questionably fair, since bard has internet
   access.
 - **2024-01-29**: miqu gets released. This is a suspected Mistral_Medium leak.
   Despite only having a 4bit-quantized version, it's ahead of all current
   locallamas.
 - **2024-01-30**: Yi-34B is the largest local llama for language-vision. LLaVA 1.6
   based on top of it sets new records in vision performance.
 - **2024-02-08**: Google releases Gemini Advanced, a GPT4 competitor with similar
   pricing. Public opinion seems to be that it's quite a bit worse that GPT4,
   except it's less censored and much better at creative writing.
 - **2024-02-15**: Google releases Gemini Pro 1.5, with 1 million tokens of context!
   Third party testing on r/localllama shows it's effectively about to query
   very large codebases, beating out GPT4 (with 32k context) on every test.
 - **2024-02-15**: OpenAI releases Sora, a text-to-video model for up to 60s of
   video. A huge amount of hype starts up around it "simulating the world", but
   it's only open to a very small tester group.
 - **2024-02-26**: Mistral releases Mistral-Large and simultaneously removes all
   the mentions of a commitment to open source from their website. They revert
   this change the following day, after the community backlash.
 - **2024-03-27**: DataBricks open sources DBRX, a 132B parameter MoE with 36B
   parameters active per forward pass. It was trained on 12T tokens. According
   to user evaluation, it beats Mixtral for all uses.
 - **2024-04-18**: Meta releases LLaMA3 8b and 70b. 70b is the new best open
   model, right around Claude3 Sonnet and above older gpt4 versions!