Llama: add base64 llama blog
This commit is contained in:
parent
67b25a1814
commit
a50bd21781
1 changed files with 119 additions and 0 deletions
119
src/content/llama/the-secret-learnings-of-llamas.md
Normal file
119
src/content/llama/the-secret-learnings-of-llamas.md
Normal file
|
@ -0,0 +1,119 @@
|
|||
---
|
||||
title: 'The Secret Learnings of Llamas'
|
||||
description: 'The useless things our llamas are learning'
|
||||
updateDate: 'Mar 28 2024'
|
||||
heroText: 'Base64 LLama'
|
||||
---
|
||||
|
||||
# The Secret Learnings of Llamas
|
||||
|
||||
Tool use by llamas is an active area of research. Recent implementations like
|
||||
Devin promise great productivity increases through tool use. I was investigating
|
||||
tool use by some modern llamas, when I made an unfortunate discovery.
|
||||
|
||||
It appears most large llamas have learned a new language, in addition to the
|
||||
ones that were intended: base64.
|
||||
|
||||
### Base64 Background
|
||||
|
||||
Base64 is a simple encoding scheme. This is different from encryption and
|
||||
hashing, as those provide security, while base64 just transforms data into a
|
||||
portable form.
|
||||
|
||||
Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each
|
||||
bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6
|
||||
(64) possible states, but this makes the vocabulary much smaller. With just 64
|
||||
letters and numbers, it can hold 64 states per character.
|
||||
|
||||
Let's visualize how base64 works. Say we have the following word:
|
||||
|
||||
```
|
||||
Hello
|
||||
```
|
||||
|
||||
This has a utf-8 encoding below. I used the `ord` function in python to get the
|
||||
numbers in the `Base 10` row. I then converted the base 10 representations to
|
||||
octal (base 8) and binary (base 2). The bottom two rows are the same, but the
|
||||
spacing makes it easier to see the direct mapping from octal to binary:
|
||||
|
||||
```
|
||||
Letters: H e l l o
|
||||
Base 10: 72 101 108 108 111
|
||||
Base 8: 100 145 154 154 157
|
||||
Base 2: 001000000 001100101 001101100 001101100 001101111
|
||||
Base 8 (spaced): 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
|
||||
Base 2 (spaced): 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
|
||||
```
|
||||
|
||||
Notice how there's a 1:1 mapping between every 3 digits in binary to every digit
|
||||
in octal. This means octal can represent 2^3 (8) states per digit. Octal only
|
||||
uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64
|
||||
does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64
|
||||
digits.
|
||||
|
||||
Now we can map in reverse:
|
||||
|
||||
```
|
||||
Base 2 (spaced): 000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
|
||||
Base 8 (spaced): 0 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7
|
||||
|
||||
Base 2 (spaced): 000001 000000 001100 101001 101100 001101 100001 101111
|
||||
Base 8 (spaced): 01 00 14 51 54 15 41 57
|
||||
Base 64 (spaced): B A M p s N h v
|
||||
```
|
||||
|
||||
So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used
|
||||
to encode images and other binary data to store in JSON. It is not space
|
||||
efficient, taking up more space than it should, but it's entirely made of
|
||||
printable characters.
|
||||
|
||||
## Base64 Llamas
|
||||
|
||||
It appears large llamas have learned base64, similar to how n-grams learned
|
||||
speech. You can test this yourself! Just go onto Mistral's [Le
|
||||
Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX
|
||||
model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding
|
||||
some data!
|
||||
|
||||
You can generate these on unix using the `base64` program. For example:
|
||||
|
||||
```bash
|
||||
echo 'how are you today?' | base64
|
||||
# Gives aG93IGFyZSB5b3UgdG9kYXk/Cg==
|
||||
```
|
||||
|
||||
Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string
|
||||
you want. You'll notice that they break down after a about 10-20 characters,
|
||||
depending on how good the llama is.
|
||||
|
||||
|
||||
You could also ask for the opposite as well. If a llama gives
|
||||
`aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with:
|
||||
|
||||
```bash
|
||||
echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d
|
||||
```
|
||||
|
||||
The prompts should look something like:
|
||||
|
||||
```
|
||||
Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg==
|
||||
|
||||
Encode "emiliko@mami2.moe" into base64.
|
||||
```
|
||||
|
||||
## What are Llamas Learning?
|
||||
|
||||
This discovery was shocking to me. I thought they were achieving this through
|
||||
tool use, but I can cross-verify on localllamas which most certainly don't have
|
||||
access to tools. This means our 100-billion scale llamas are learning to be a
|
||||
base64 decoder?
|
||||
|
||||
Of course this is a completely pointless feature, as no llama will ever be more
|
||||
energy efficient than a trivially coded base64 tool. The Llamas likely picked it
|
||||
up while learning on sample code, but the degree to which they picked it up is
|
||||
incredible!
|
||||
|
||||
This has lead me to wonder, what other completely pointless things are our
|
||||
llamas learning? This one was an unindented side effect of learning to code, but
|
||||
what other side effects is our data having?
|
Loading…
Reference in a new issue