Unix: add a history of llamas
This commit is contained in:
parent
e509097d20
commit
9128bf5799
167
src/content/unix/a-history-of-llamas.md
Normal file
167
src/content/unix/a-history-of-llamas.md
Normal file
|
@ -0,0 +1,167 @@
|
||||||
|
---
|
||||||
|
title: 'A Brief History of Local Llamas'
|
||||||
|
description: 'A Brief History of Local Llamas'
|
||||||
|
updateDate: 'Dec 31 2023'
|
||||||
|
heroImage: '/images/tiny-llama-logo.avif'
|
||||||
|
---
|
||||||
|
|
||||||
|
<p style="font-size: max(2vh, 10px); margin-top: 0; text-align: right">
|
||||||
|
"TinyLlama logo" by <a href="https://github.com/jzhang38/TinyLlama">The
|
||||||
|
TinyLlama project</a>. Licensed under Apache 2.0
|
||||||
|
</p>
|
||||||
|
|
||||||
|
## My Background
|
||||||
|
|
||||||
|
I've been taking machine learning courses throughout the "modern history" of
|
||||||
|
llamas. When ChatGPT was first released, we bought in a guest lecturer on NLP
|
||||||
|
methods of the time. Since then, I've also taken an NLP course, though not one
|
||||||
|
focused on deep learning.
|
||||||
|
|
||||||
|
Most my knowledge of this field comes from a few guest lectures, and the
|
||||||
|
indispensable [r/locallama](https://www.reddit.com/r/LocalLLaMA/) community,
|
||||||
|
which always has the latest news about local llamas. I've become a fan of the
|
||||||
|
local llama movement in December 2023, so the "important points" covered here
|
||||||
|
are coming from a retrospective.
|
||||||
|
|
||||||
|
I use the terms "large language model" and "llama" interchangeably, throughout
|
||||||
|
this piece. I write "open source and locally hosted llama" as "local llama".
|
||||||
|
Whenever you see numbers 7B, that means the llama has 7 billion parameters. More
|
||||||
|
parameters means the model is smarter but bigger.
|
||||||
|
|
||||||
|
## Modern History
|
||||||
|
|
||||||
|
The modern history of llamas begins in 2022. 90% of sources llama papers cite
|
||||||
|
these days are from 2022 onwards.
|
||||||
|
|
||||||
|
Here's a brief timeline:
|
||||||
|
|
||||||
|
1. **March 2022**: InstructGPT paper is pre-print.
|
||||||
|
2. **November 2022**: ChatGPT is released.
|
||||||
|
3. **March 2023**: LLaMa (open source) and GPT4 is released.
|
||||||
|
4. **July 2023**: LLaMa 2 is released, alongside GGUF quantization.
|
||||||
|
5. **August 2023**: AWQ quantization paper.
|
||||||
|
6. **September 2023**: Mistral 7B is released.
|
||||||
|
7. **December 2023**: Mixtral 8x7B becomes the first MoE local llama.
|
||||||
|
|
||||||
|
#### Early 2022
|
||||||
|
|
||||||
|
In March 2022, OpenAI released a paper improving the conversational ability of
|
||||||
|
their then uncontested GPT3. Interest in
|
||||||
|
[InstructGPT](https://arxiv.org/pdf/2203.02155.pdf) was largely limited to the
|
||||||
|
academic community. As of writing, InstructGPT remains the last major paper
|
||||||
|
OpenAI released on this topic.
|
||||||
|
|
||||||
|
The remainder of 2022 was mostly focused on text-to-image models. OpenAI's
|
||||||
|
[DALL-E 2](https://openai.com/dall-e-2) lead the way, but the open source
|
||||||
|
community kept pace with [Stable
|
||||||
|
Diffusion](https://github.com/Stability-AI/generative-models).
|
||||||
|
[Midjourney](https://www.midjourney.com/) eventually ends up on top by mid-2022,
|
||||||
|
and causes a huge amount of controversy with artists by [winning the Colorado
|
||||||
|
State
|
||||||
|
Fair's](https://en.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%A9ra_Spatial)
|
||||||
|
fine-art contest.
|
||||||
|
|
||||||
|
#### Late 2022
|
||||||
|
|
||||||
|
In late November 2022, OpenAI [released
|
||||||
|
ChatGPT](https://openai.com/blog/chatgpt), which is generally speculated to be a
|
||||||
|
larger version of InstructGPT. This model is single-handedly responsible for the
|
||||||
|
NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in
|
||||||
|
general. It was considered a form of disruptive innovation in the search engine
|
||||||
|
market, leading Google to hold record layoffs in [January
|
||||||
|
2023](https://blog.google/inside-google/message-ceo/january-update/).
|
||||||
|
|
||||||
|
At this point, the local llama movement practically didn't exist. Front-ends,
|
||||||
|
especially for chatbot role-play as later exemplified by
|
||||||
|
[SillyTavern](https://github.com/SillyTavern/SillyTavern), began development,
|
||||||
|
but they were all still running through the OpenAI's ChatGPT API.
|
||||||
|
|
||||||
|
#### Early 2023
|
||||||
|
|
||||||
|
In March 2023, Meta kick-started the local llama movement, by releasing
|
||||||
|
[LlaMa](https://ai.meta.com/blog/large-language-model-llama-meta-ai/) a 65B
|
||||||
|
parameter llama that was open source! Benchmarks aside, it was not very good.
|
||||||
|
ChatGPT continued to be viewed as considerably better. However, it provided a
|
||||||
|
strong base for further iteration, and gave the local llama community a much
|
||||||
|
stronger model than any other local llama at the time.
|
||||||
|
|
||||||
|
Around this time, OpenAI released [GPT4](https://openai.com/research/gpt-4), a
|
||||||
|
model that undeniably broke through all records. In fact, GPT4 remains the best
|
||||||
|
llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5,
|
||||||
|
to disambiguate it from GPT4. This caused much of the local llama community to
|
||||||
|
continue focusing on building frontends, while using GPT4's API for inference.
|
||||||
|
Nothing open source was even remotely close to GPT4 at this point.
|
||||||
|
|
||||||
|
#### Mid 2023
|
||||||
|
|
||||||
|
We finally see the local llama movement really take off around August 2023. Meta
|
||||||
|
released [LLaMa2](https://ai.meta.com/blog/llama-2/), which has decent
|
||||||
|
performance even on its 7B version. One key contribution of LLaMa2 was the GGUF
|
||||||
|
quantization format. This format allows a model to be run on a mix of RAM and
|
||||||
|
VRAM, which meant many home computers could now run 4-bit quantized 7B models!
|
||||||
|
Previously, most enthusiasts would have to rent cloud GPUs to run their "local"
|
||||||
|
llamas. Quantizing into GGUF is a very expensive process, so
|
||||||
|
[TheBloke](https://huggingface.co/TheBloke) on Huggingface emerges the defacto
|
||||||
|
source for pre-quantized llamas.
|
||||||
|
|
||||||
|
Based on LLaMa, the open source
|
||||||
|
[llama.cpp](https://github.com/ggerganov/llama.cpp) becomes the leader of local
|
||||||
|
llama inference backends. Its support extends far beyond only running LLaMa2,
|
||||||
|
it's the first major backend to support running GGUF quantizations!
|
||||||
|
|
||||||
|
In addition, the [Activation-aware Weight
|
||||||
|
Quantization](https://arxiv.org/abs/2306.00978) (AWQ) paper is released at this
|
||||||
|
time. It uses mixed-quantization to increase both the speed and performance of
|
||||||
|
quantized models. This is especially true for very heavily quantized models like
|
||||||
|
4-bit quantization, that has become the standard for the local llama
|
||||||
|
community at this point. AWQ lacks support anywhere at this time.
|
||||||
|
|
||||||
|
In late September 2023, out of nowhere came a French startup with a 7B model
|
||||||
|
that made leaps on top of LLaMa2 7B.
|
||||||
|
[Mistral](https://mistral.ai/news/announcing-mistral-7b/) remains the best local
|
||||||
|
llama until mid-December 2023. Huge work in improving model tuning, particularly
|
||||||
|
character creation and code-assistant models, is done on top of Mistral 7B.
|
||||||
|
|
||||||
|
#### Late 2023
|
||||||
|
|
||||||
|
It's worth noting that at this point, there is *still* no local competitor to
|
||||||
|
GPT3.5. But that was all about to change on December 11th 2023, when Mistral
|
||||||
|
released [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/). Mixtral is
|
||||||
|
the first major local llama to use the same technique as GPT4; a mixture of
|
||||||
|
experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of
|
||||||
|
GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user
|
||||||
|
reviews. This is hailed as a landmark achievement by the local llama community,
|
||||||
|
demonstrating that open source models are able to compete with commercially
|
||||||
|
developed ones. The achievement is amplified by comparing Mixtral against Google
|
||||||
|
freshly-unveiled [Gemini](https://blog.google/technology/ai/google-gemini-ai/)
|
||||||
|
models in the same week. R/localllama reviews generally suggest Mixtral pulls
|
||||||
|
ahead of Gemini in conversational tasks.
|
||||||
|
|
||||||
|
Apple unexpectedly joins the local llama movement, by open-sourcing their
|
||||||
|
[Ferret model](https://github.com/apple/ml-ferret) in mid-December. This model
|
||||||
|
builds upon LLaVA, previously the leading multi-modal llama for images in the
|
||||||
|
local llama community.
|
||||||
|
|
||||||
|
In very late December, llama.cpp [merges
|
||||||
|
AWQ](https://github.com/ggerganov/llama.cpp/pull/4593) support. In the coming
|
||||||
|
year, I expect AWQ to largely replace GPTQ in the local llama community, though
|
||||||
|
GGUF will remain more popular in general.
|
||||||
|
|
||||||
|
Just 3 days from the end of the year,
|
||||||
|
[TinyLlama](https://github.com/jzhang38/TinyLlama) releases their 3 trillion
|
||||||
|
token checkpoint, on their 1.1B model. This miniscule model sets a new
|
||||||
|
lower-bound for the number of neurons required to make a capable llama, enabling
|
||||||
|
more users to locally host their llama. In practice, TinyLlama easily goes
|
||||||
|
toe-to-toe with Microsoft's closed-source [Phi-2
|
||||||
|
2.7B](https://huggingface.co/microsoft/phi-2) released just a few weeks earlier.
|
||||||
|
This is a huge win for the open source community, demonstrating how open source
|
||||||
|
models can get ahead of commercial ones.
|
||||||
|
|
||||||
|
#### Going Forward
|
||||||
|
|
||||||
|
With the release of Mixtral, the local llama community is hoping 2024 will be
|
||||||
|
the turning point where open source models finally break ahead of commercial
|
||||||
|
models. However as of writing, it's very unclear how the community will break
|
||||||
|
through GPT4, the llama that remains uncontested in practice.
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue