Llama: add base64 llama blog

2024-03-28 19:52:25 -06:00 · 2024-03-28 19:52:25 -06:00 · a50bd21781
commit a50bd21781
parent 67b25a1814
1 changed files with 119 additions and 0 deletions
--- a/src/content/llama/the-secret-learnings-of-llamas.md
+++ b/src/content/llama/the-secret-learnings-of-llamas.md
@ -0,0 +1,119 @@
+---
+title: 'The Secret Learnings of Llamas'
+description: 'The useless things our llamas are learning'
+updateDate: 'Mar 28 2024'
+heroText: 'Base64 LLama'
+---
+
+# The Secret Learnings of Llamas
+
+Tool use by llamas is an active area of research. Recent implementations like
+Devin promise great productivity increases through tool use. I was investigating
+tool use by some modern llamas, when I made an unfortunate discovery.
+
+It appears most large llamas have learned a new language, in addition to the
+ones that were intended: base64.
+
+### Base64 Background
+
+Base64 is a simple encoding scheme. This is different from encryption and
+hashing, as those provide security, while base64 just transforms data into a
+portable form.
+
+Each byte is 8 bits. This means there are 2^8 (256) possible bytes, since each
+bit contributes 2 states. Base64 encodes such that each bytes only stores 2^6
+(64) possible states, but this makes the vocabulary much smaller. With just 64
+letters and numbers, it can hold 64 states per character.
+
+Let's visualize how base64 works. Say we have the following word:
+
+```
+Hello
+```
+
+This has a utf-8 encoding below. I used the `ord` function in python to get the
+numbers in the `Base 10` row. I then converted the base 10 representations to
+octal (base 8) and binary (base 2). The bottom two rows are the same, but the
+spacing makes it easier to see the direct mapping from octal to binary:
+
+```
+Letters:         H         e         l         l         o
+Base 10:        72       101       108       108       111
+Base 8:        100       145       154       154       157
+Base 2:  001000000 001100101 001101100 001101100 001101111
+Base 8 (spaced):    1   0   0   1   4   5   1   5   4   1   5   4   1   5   7
+Base 2 (spaced):  001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
+```
+
+Notice how there's a 1:1 mapping between every 3 digits in binary to every digit
+in octal. This means octal can represent 2^3 (8) states per digit. Octal only
+uses digits 0-7, but what if we wanted to represent 2^6 states per digit? Base64
+does this by using digits in `0-9`, `a-z`, `A-Z`, `+` and `/`. That gives 64
+digits.
+
+Now we can map in reverse:
+
+```
+Base 2 (spaced):  000 001 000 000 001 100 101 001 101 100 001 101 100 001 101 111
+Base 8 (spaced):    0   1   0   0   1   4   5   1   5   4   1   5   4   1   5   7
+
+Base 2 (spaced):  000001 000000 001100 101001 101100 001101 100001 101111
+Base 8 (spaced):      01     00     14     51     54     15     41     57
+Base 64 (spaced):      B      A      M      p      s      N      h      v
+```
+
+So we can encode the word `Hello` as `BAMpsNhv` in base64! Base64 is often used
+to encode images and other binary data to store in JSON. It is not space
+efficient, taking up more space than it should, but it's entirely made of
+printable characters.
+
+## Base64 Llamas
+
+It appears large llamas have learned base64, similar to how n-grams learned
+speech. You can test this yourself! Just go onto Mistral's [Le
+Chat](https://chat.mistral.ai) or Data Bricks' new and [open DBRX
+model](https://huggingface.co/spaces/databricks/dbrx-instruct) and try decoding
+some data!
+
+You can generate these on unix using the `base64` program. For example:
+
+```bash
+echo 'how are you today?' | base64
+# Gives aG93IGFyZSB5b3UgdG9kYXk/Cg==
+```
+
+Then ask a llama about `aG93IGFyZSB5b3UgdG9kYXk/Cg==` or whatever other string
+you want. You'll notice that they break down after a about 10-20 characters,
+depending on how good the llama is.
+
+
+You could also ask for the opposite as well. If a llama gives
+`aG93IGFyZSB5b3UgdG9kYXk/Cg==`, you can decode it with:
+
+```bash
+echo 'aG93IGFyZSB5b3UgdG9kYXk/Cg==' | base64 -d
+```
+
+The prompts should look something like:
+
+```
+Decode the following base64 message: aG93IGFyZSB5b3UgdG9kYXk/Cg==
+
+Encode "emiliko@mami2.moe" into base64.
+```
+
+## What are Llamas Learning?
+
+This discovery was shocking to me. I thought they were achieving this through
+tool use, but I can cross-verify on localllamas which most certainly don't have
+access to tools. This means our 100-billion scale llamas are learning to be a
+base64 decoder?
+
+Of course this is a completely pointless feature, as no llama will ever be more
+energy efficient than a trivially coded base64 tool. The Llamas likely picked it
+up while learning on sample code, but the degree to which they picked it up is
+incredible!
+
+This has lead me to wonder, what other completely pointless things are our
+llamas learning? This one was an unindented side effect of learning to code, but
+what other side effects is our data having?