diff --git a/src/content/llama/the-secret-learnings-of-llamas.md b/src/content/llama/the-secret-learnings-of-llamas.md new file mode 100644 index 0000000..0d10d2f --- /dev/null +++ b/src/content/llama/the-secret-learnings-of-llamas.md @@ -0,0 +1,119 @@ +--- +title: 'The Secret Learnings of Llamas' +description: 'The useless things our llamas are learning' +updateDate: 'Mar 28 2024' +heroText: 'Base64 LLama' +--- + +# The Secret Learnings of Llamas + +Tool use by llamas is an active area of research. Recent implementations like +Devin promise great productivity increases through tool use. I was investigating +tool use by some modern llamas, when I made an unfortunate discovery. + +It appears most large llamas have learned a new language, in addition to the +ones that were intended: base64. + +### Base64 Background + +Base64 is a simple encoding scheme. This is different from encryption and +hashing, as those provide security, while base64 just transforms data into a +portable form. + +Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each +bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6 +(64) possible states, but this makes the vocabulary much smaller. With just 64 +letters and numbers, it can hold 64 states per character. + +Let's visualize how base64 works. Say we have the following word: + +``` +Hello +``` + +This has a utf-8 encoding below. I used the `ord` function in python to get the +numbers in the `Base 10` row. I then converted the base 10 representations to +octal (base 8) and binary (base 2). The bottom two rows are the same, but the +spacing makes it easier to see the direct mapping from octal to binary: + +``` +Letters: H e l l o +Base 10: 72 101 108 108 111 +Base 8: 100 145 154 154 157 +Base 2: 001000000 001100101 001101100 001101100 001101111 +Base 8 (spaced): 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7 +Base 2 (spaced): 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111 +``` + +Notice how there's a 1:1 mapping between every 3 digits in binary to every digit +in octal. This means octal can represent 2^3 (8) states per digit. Octal only +uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64 +does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64 +digits. + +Now we can map in reverse: + +``` +Base 2 (spaced): 000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111 +Base 8 (spaced): 0 1 0 0 1 4 5 1 5 4 1 5 4 1 5 7 + +Base 2 (spaced): 000001 000000 001100 101001 101100 001101 100001 101111 +Base 8 (spaced): 01 00 14 51 54 15 41 57 +Base 64 (spaced): B A M p s N h v +``` + +So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used +to encode images and other binary data to store in JSON. It is not space +efficient, taking up more space than it should, but it's entirely made of +printable characters. + +## Base64 Llamas + +It appears large llamas have learned base64, similar to how n-grams learned +speech. You can test this yourself! Just go onto Mistral's [Le +Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX +model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding +some data! + +You can generate these on unix using the `base64` program. For example: + +```bash +echo 'how are you today?' | base64 +# Gives aG93IGFyZSB5b3UgdG9kYXk/Cg== +``` + +Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string +you want. You'll notice that they break down after a about 10-20 characters, +depending on how good the llama is. + + +You could also ask for the opposite as well. If a llama gives +`aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with: + +```bash +echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d +``` + +The prompts should look something like: + +``` +Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg== + +Encode "emiliko@mami2.moe" into base64. +``` + +## What are Llamas Learning? + +This discovery was shocking to me. I thought they were achieving this through +tool use, but I can cross-verify on localllamas which most certainly don't have +access to tools. This means our 100-billion scale llamas are learning to be a +base64 decoder? + +Of course this is a completely pointless feature, as no llama will ever be more +energy efficient than a trivially coded base64 tool. The Llamas likely picked it +up while learning on sample code, but the degree to which they picked it up is +incredible! + +This has lead me to wonder, what other completely pointless things are our +llamas learning? This one was an unindented side effect of learning to code, but +what other side effects is our data having?