Llama: add base64 llama blog

2024-03-28 19:52:25 -06:00 · 2024-03-28 19:52:25 -06:00 · a50bd21781
commit a50bd21781
parent 67b25a1814
1 changed files with 119 additions and 0 deletions
--- a/src/content/llama/the-secret-learnings-of-llamas.md
+++ b/src/content/llama/the-secret-learnings-of-llamas.md
@ -0,0 +1,119 @@
 ---
 title: 'The Secret Learnings of Llamas'
 description: 'The useless things our llamas are learning'
 updateDate: 'Mar 28 2024'
 heroText: 'Base64 LLama'
 ---
 # The Secret Learnings of Llamas
 Tool use by llamas is an active area of research. Recent implementations like
 Devin promise great productivity increases through tool use. I was investigating
 tool use by some modern llamas, when I made an unfortunate discovery.
 It appears most large llamas have learned a new language, in addition to the
 ones that were intended: base64.
 ### Base64 Background
 Base64 is a simple encoding scheme. This is different from encryption and
 hashing, as those provide security, while base64 just transforms data into a
 portable form.
 Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each
 bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6
 (64) possible states, but this makes the vocabulary much smaller. With just 64
 letters and numbers, it can hold 64 states per character.
 Let's visualize how base64 works. Say we have the following word:
 ```
 Hello
 ```
 This has a utf-8 encoding below. I used the `ord` function in python to get the
 numbers in the `Base 10` row. I then converted the base 10 representations to
 octal (base 8) and binary (base 2). The bottom two rows are the same, but the
 spacing makes it easier to see the direct mapping from octal to binary:
 ```
 Letters:         H         e         l         l         o
 Base 10:        72       101       108       108       111
 Base 8:        100       145       154       154       157
 Base 2:  001000000 001100101 001101100 001101100 001101111
 Base 8 (spaced):    1   0   0   1   4   5   1   5   4   1   5   4   1   5   7
 Base 2 (spaced):  001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
 ```
 Notice how there's a 1:1 mapping between every 3 digits in binary to every digit
 in octal. This means octal can represent 2^3 (8) states per digit. Octal only
 uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64
 does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64
 digits.
 Now we can map in reverse:
 ```
 Base 2 (spaced):  000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
 Base 8 (spaced):    0   1   0   0   1   4   5   1   5   4   1   5   4   1   5   7
 Base 2 (spaced):  000001 000000 001100 101001 101100 001101 100001 101111
 Base 8 (spaced):      01     00     14     51     54     15     41     57
 Base 64 (spaced):      B      A      M      p      s      N      h      v
 ```
 So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used
 to encode images and other binary data to store in JSON. It is not space
 efficient, taking up more space than it should, but it's entirely made of
 printable characters.
 ## Base64 Llamas
 It appears large llamas have learned base64, similar to how n-grams learned
 speech. You can test this yourself! Just go onto Mistral's [Le
 Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX
 model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding
 some data!
 You can generate these on unix using the `base64` program. For example:
 ```bash
 echo 'how are you today?' | base64
 # Gives aG93IGFyZSB5b3UgdG9kYXk/Cg==
 ```
 Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string
 you want. You'll notice that they break down after a about 10-20 characters,
 depending on how good the llama is.
 You could also ask for the opposite as well. If a llama gives
 `aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with:
 ```bash
 echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d
 ```
 The prompts should look something like:
 ```
 Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg==
 Encode "emiliko@mami2.moe" into base64.
 ```
 ## What are Llamas Learning?
 This discovery was shocking to me. I thought they were achieving this through
 tool use, but I can cross-verify on localllamas which most certainly don't have
 access to tools. This means our 100-billion scale llamas are learning to be a
 base64 decoder?
 Of course this is a completely pointless feature, as no llama will ever be more
 energy efficient than a trivially coded base64 tool. The Llamas likely picked it
 up while learning on sample code, but the degree to which they picked it up is
 incredible!
 This has lead me to wonder, what other completely pointless things are our
 llamas learning? This one was an unindented side effect of learning to code, but
 what other side effects is our data having?