Llama: add base64 llama blog

This commit is contained in:
Akemi Izuko 2024-03-28 19:52:25 -06:00
parent 67b25a1814
commit a50bd21781
Signed by: akemi
GPG key ID: 8DE0764E1809E9FC

View file

@ -0,0 +1,119 @@
---
title: 'The Secret Learnings of Llamas'
description: 'The useless things our llamas are learning'
updateDate: 'Mar 28 2024'
heroText: 'Base64 LLama'
---
# The Secret Learnings of Llamas
Tool use by llamas is an active area of research. Recent implementations like
Devin promise great productivity increases through tool use. I was investigating
tool use by some modern llamas, when I made an unfortunate discovery.
It appears most large llamas have learned a new language, in addition to the
ones that were intended: base64.
### Base64 Background
Base64 is a simple encoding scheme. This is different from encryption and
hashing, as those provide security, while base64 just transforms data into a
portable form.
Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each
bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6
(64) possible states, but this makes the vocabulary much smaller. With just 64
letters and numbers, it can hold 64 states per character.
Let's visualize how base64 works. Say we have the following word:
```
Hello
```
This has a utf-8 encoding below. I used the `ord` function in python to get the
numbers in the `Base 10` row. I then converted the base 10 representations to
octal (base 8) and binary (base 2). The bottom two rows are the same, but the
spacing makes it easier to see the direct mapping from octal to binary:
```
Letters: H e l l o
Base 10: 72 101 108 108 111
Base 8: 100 145 154 154 157
Base 2: 001000000 001100101 001101100 001101100 001101111
Base 8 (spaced): 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
Base 2 (spaced): 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
```
Notice how there's a 1:1 mapping between every 3 digits in binary to every digit
in octal. This means octal can represent 2^3 (8) states per digit. Octal only
uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64
does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64
digits.
Now we can map in reverse:
```
Base 2 (spaced): 000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
Base 8 (spaced): 0 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
Base 2 (spaced): 000001 000000 001100 101001 101100 001101 100001 101111
Base 8 (spaced): 01 00 14 51 54 15 41 57
Base 64 (spaced): B A M p s N h v
```
So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used
to encode images and other binary data to store in JSON. It is not space
efficient, taking up more space than it should, but it's entirely made of
printable characters.
## Base64 Llamas
It appears large llamas have learned base64, similar to how n-grams learned
speech. You can test this yourself! Just go onto Mistral's [Le
Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX
model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding
some data!
You can generate these on unix using the `base64` program. For example:
```bash
echo 'how are you today?' | base64
# Gives aG93IGFyZSB5b3UgdG9kYXk/Cg==
```
Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string
you want. You'll notice that they break down after a about 10-20 characters,
depending on how good the llama is.
You could also ask for the opposite as well. If a llama gives
`aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with:
```bash
echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d
```
The prompts should look something like:
```
Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg==
Encode "emiliko@mami2.moe" into base64.
```
## What are Llamas Learning?
This discovery was shocking to me. I thought they were achieving this through
tool use, but I can cross-verify on localllamas which most certainly don't have
access to tools. This means our 100-billion scale llamas are learning to be a
base64 decoder?
Of course this is a completely pointless feature, as no llama will ever be more
energy efficient than a trivially coded base64 tool. The Llamas likely picked it
up while learning on sample code, but the degree to which they picked it up is
incredible!
This has lead me to wonder, what other completely pointless things are our
llamas learning? This one was an unindented side effect of learning to code, but
what other side effects is our data having?