Llama: add base64 llama blog
This commit is contained in:
parent
67b25a1814
commit
a50bd21781
119
src/content/llama/the-secret-learnings-of-llamas.md
Normal file
119
src/content/llama/the-secret-learnings-of-llamas.md
Normal file
|
@ -0,0 +1,119 @@
|
||||||
|
---
|
||||||
|
title: 'The Secret Learnings of Llamas'
|
||||||
|
description: 'The useless things our llamas are learning'
|
||||||
|
updateDate: 'Mar 28 2024'
|
||||||
|
heroText: 'Base64 LLama'
|
||||||
|
---
|
||||||
|
|
||||||
|
# The Secret Learnings of Llamas
|
||||||
|
|
||||||
|
Tool use by llamas is an active area of research. Recent implementations like
|
||||||
|
Devin promise great productivity increases through tool use. I was investigating
|
||||||
|
tool use by some modern llamas, when I made an unfortunate discovery.
|
||||||
|
|
||||||
|
It appears most large llamas have learned a new language, in addition to the
|
||||||
|
ones that were intended: base64.
|
||||||
|
|
||||||
|
### Base64 Background
|
||||||
|
|
||||||
|
Base64 is a simple encoding scheme. This is different from encryption and
|
||||||
|
hashing, as those provide security, while base64 just transforms data into a
|
||||||
|
portable form.
|
||||||
|
|
||||||
|
Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each
|
||||||
|
bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6
|
||||||
|
(64) possible states, but this makes the vocabulary much smaller. With just 64
|
||||||
|
letters and numbers, it can hold 64 states per character.
|
||||||
|
|
||||||
|
Let's visualize how base64 works. Say we have the following word:
|
||||||
|
|
||||||
|
```
|
||||||
|
Hello
|
||||||
|
```
|
||||||
|
|
||||||
|
This has a utf-8 encoding below. I used the `ord` function in python to get the
|
||||||
|
numbers in the `Base 10` row. I then converted the base 10 representations to
|
||||||
|
octal (base 8) and binary (base 2). The bottom two rows are the same, but the
|
||||||
|
spacing makes it easier to see the direct mapping from octal to binary:
|
||||||
|
|
||||||
|
```
|
||||||
|
Letters: H e l l o
|
||||||
|
Base 10: 72 101 108 108 111
|
||||||
|
Base 8: 100 145 154 154 157
|
||||||
|
Base 2: 001000000 001100101 001101100 001101100 001101111
|
||||||
|
Base 8 (spaced): 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
|
||||||
|
Base 2 (spaced): 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice how there's a 1:1 mapping between every 3 digits in binary to every digit
|
||||||
|
in octal. This means octal can represent 2^3 (8) states per digit. Octal only
|
||||||
|
uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64
|
||||||
|
does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64
|
||||||
|
digits.
|
||||||
|
|
||||||
|
Now we can map in reverse:
|
||||||
|
|
||||||
|
```
|
||||||
|
Base 2 (spaced): 000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
|
||||||
|
Base 8 (spaced): 0 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
|
||||||
|
|
||||||
|
Base 2 (spaced): 000001 000000 001100 101001 101100 001101 100001 101111
|
||||||
|
Base 8 (spaced): 01 00 14 51 54 15 41 57
|
||||||
|
Base 64 (spaced): B A M p s N h v
|
||||||
|
```
|
||||||
|
|
||||||
|
So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used
|
||||||
|
to encode images and other binary data to store in JSON. It is not space
|
||||||
|
efficient, taking up more space than it should, but it's entirely made of
|
||||||
|
printable characters.
|
||||||
|
|
||||||
|
## Base64 Llamas
|
||||||
|
|
||||||
|
It appears large llamas have learned base64, similar to how n-grams learned
|
||||||
|
speech. You can test this yourself! Just go onto Mistral's [Le
|
||||||
|
Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX
|
||||||
|
model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding
|
||||||
|
some data!
|
||||||
|
|
||||||
|
You can generate these on unix using the `base64` program. For example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo 'how are you today?' | base64
|
||||||
|
# Gives aG93IGFyZSB5b3UgdG9kYXk/Cg==
|
||||||
|
```
|
||||||
|
|
||||||
|
Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string
|
||||||
|
you want. You'll notice that they break down after a about 10-20 characters,
|
||||||
|
depending on how good the llama is.
|
||||||
|
|
||||||
|
|
||||||
|
You could also ask for the opposite as well. If a llama gives
|
||||||
|
`aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d
|
||||||
|
```
|
||||||
|
|
||||||
|
The prompts should look something like:
|
||||||
|
|
||||||
|
```
|
||||||
|
Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg==
|
||||||
|
|
||||||
|
Encode "emiliko@mami2.moe" into base64.
|
||||||
|
```
|
||||||
|
|
||||||
|
## What are Llamas Learning?
|
||||||
|
|
||||||
|
This discovery was shocking to me. I thought they were achieving this through
|
||||||
|
tool use, but I can cross-verify on localllamas which most certainly don't have
|
||||||
|
access to tools. This means our 100-billion scale llamas are learning to be a
|
||||||
|
base64 decoder?
|
||||||
|
|
||||||
|
Of course this is a completely pointless feature, as no llama will ever be more
|
||||||
|
energy efficient than a trivially coded base64 tool. The Llamas likely picked it
|
||||||
|
up while learning on sample code, but the degree to which they picked it up is
|
||||||
|
incredible!
|
||||||
|
|
||||||
|
This has lead me to wonder, what other completely pointless things are our
|
||||||
|
llamas learning? This one was an unindented side effect of learning to code, but
|
||||||
|
what other side effects is our data having?
|
Loading…
Reference in a new issue