Hey Dan, thanks for the post! You mentioned that byte pair encoding is used for efficiency - but this is something that confuses me a bit. If it were possible to use character level encoding for an LLM, wouldn't someone have done it by now? Or maybe they have and I just haven't come across it. But when I was learning how transformers work, it seemed like the dimension of the embedding space would effectively be limited by the size of the token vocabulary. Using an embedding space with more dimensions than the number of unique tokens would just end up making all the token representations orthogonal to each other (I think). But an embedding space with 26 dimensions doesn't seem likely to capture much useful world knowledge. Not to mention it would limit your context length because you could only have 26 sinusoids in your position encoding...
I'm really the wrong person to ask and I haven't read much on the topic because frankly this topic hasn't interested me much (although with the rise of transformers it is interesting me a bit more now). I haven't even read all of Gwern's post.
Doing a quick search online, I read that the amount of computation has "quadratic time and space complexity of attention layer computations w.r.t. the number of input tokens n."
Whereas the complexity scales linearly with the dimensionality of the embedding vector (the size of the 'vocabulary' of tokens).
"an embedding space with 26 dimensions doesn't seem likely to capture much useful world knowledge" - yes, but I would guess that in the first layer combinations of characters could be combined into words quite easily.
My guess is early on people did studies and showed that accuracy on most tasks is about the same with subword vs character-level tokenization. So "subword" tokenization became standard in all the software stacks so they could increase context length given their computational budget for training. However I wonder if, given all the tricks people have developed for increasing context length, maybe moving to character level tokenization is reasonable now?
If you learn anything interesting about this subject, I'd appreciate if you commented and let me know!
The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.
or "can coughing effectiveness can stop a heart attack"
Yes, good point. I agree some of this article may be misleading.. I do point to how RLHF and scaling have helped with some of these issues, but I should have been a bit more clear.
OpenAI often talks about how they are not updating the "base models" (ChatpGPT, GPT-4, etc) but clearly they are adding "patches" on top of the model. They have been playing "wack a mole" with weird bugs that people find. The SolidGoldMagikarp thing is a perfect example.
Regarding the coughing example, that was with GPT-3. If you can access GPT-3 (for instance via the API or maybe on the OpenAI playground then you should be able to replicate the issue). As I mentioned, factually incorrect info is a much less of an issue with ChatGPT (GPT-3.5) and GPT-4 due to RLHF (and possibly other techniques we don't know about). However, as I try to explain a bit, spouting factually incorrect info from the training data is a fundamental issue with LLMs and how they work. "Garbage in garbage out" still holds. The issue has just been partially solved by RLHF fine-tuning but in my view that is just "papering over" the core problem. Additionally, as a side note, RLHF makes these models perform a bit worse on some benchmarks if you dig into the data.
Nice discussion. Maybe a new job categories will emerge 'LLM docs" and "LLM Shrinks" :-) Current LLMs are a step, obviously not the end stage of AI. They're fascinating despite all their pathologies. The real questions revolve around 'our' expectations, and the reasonability of our expectations. Should we expect them to be perfect? No! Can they help in some tasks? It depends. I've been playing with several LLMs, and my conclusion is that they're helpful, but not necessarily correct or authoritative. It seems like they've been diluted by the "Alignment Enforcers". Do I expect complete or perfect answers? Absolutely not! Their training is somewhat if not seriously pathetic, in the sense that they basically use what was available in various massive crawls. Many of these crawls include seriously pathological text. Some of the texts are actually deliberately misleading, while a good portion is plainly ignorant /wrong. The best 'knowledge' usually doesn't appear in crawlable text, and often doesn't appear on any electronic medium, and rarely even in writing. There's what's called tacit knowledge that people use that never gets written down.
As a simple explicit example of that - In one of the places I worked, there was a person who was likely the best 'maker' of high frequency ultrasound transducers in the US. Making transducers is a rare art, it involves mixing materials and basically 'cooking'/'baking' them under special conditions in a unique 'oven'. As he was aging and getting ready to retire, our lab tried to induce him to write his procedures / recipe down, we provided him with personnel to train. Nope!, He wouldn't allow anyone to come into his building (yes, his own building, created to fabricate the transducers) and even watch what he was doing. Uncle Sam's Canoe Club had him on its payroll for about 4 decades, and when he retired, that knowledge was gone, the building demolished, and the capability to recreate what he did vanished. This is actually a common occurrence. A more famous version of it is Joseph von Fraunhofer closely held procedure for making Optical Glass. F made the best optical glass in the world. Germany gained a huge lead over other countries in making optical equipment, English high precision optical industry suffered as a result, Michael Faraday, one of England's best experimentalists, was tasked to reverse engineer Fraunhofer's process, and he gave up after 3 years.
What's the lesson? Really outstanding knowledge (aka 'the secret sauce' for amazing products) will likely never appear in LLMs ...
The good parts? Are there any good parts? Yes! I asked one of the LLMs to write me a python code that would print out integers that are both prime and Fibonacci. I never experiment with that before, The question just popped into my mind and within seconds of typing the prompt into one of the GPTs I had code that ran the first time. I would consider that pretty good. Was it complete? did it miss anything? I am now into checking that with mples of famous sequences in the On-Line Encyclopedia of Integer Sequences (the OEIS) [ https://oeis.org], (and am finding fascinating things). Did the LLM do its job, I would say pretty decently for a few second effort. I think one must develop certain workflow habits of knowing what to ask and how to check. Don't assume answers are complete or unique; assume there will be errors and hallucinations (performance there is getting better, in my opinion). LLMs are a tool, when you look at the root of the meaning of 'Artificial', you might encounter Francis Bacon's Artificial means that which is produced by art or human artifice. - and as we know ... humans aren't totally perfect in making machines that are perfect. :-)
You did a great job on this but didn't go into problems with image generation. People with three legs or one, and non-photographic images with weird lettering and numbering. This makes Dall-E unusable for me.
Hey Dan, thanks for the post! You mentioned that byte pair encoding is used for efficiency - but this is something that confuses me a bit. If it were possible to use character level encoding for an LLM, wouldn't someone have done it by now? Or maybe they have and I just haven't come across it. But when I was learning how transformers work, it seemed like the dimension of the embedding space would effectively be limited by the size of the token vocabulary. Using an embedding space with more dimensions than the number of unique tokens would just end up making all the token representations orthogonal to each other (I think). But an embedding space with 26 dimensions doesn't seem likely to capture much useful world knowledge. Not to mention it would limit your context length because you could only have 26 sinusoids in your position encoding...
I'm really the wrong person to ask and I haven't read much on the topic because frankly this topic hasn't interested me much (although with the rise of transformers it is interesting me a bit more now). I haven't even read all of Gwern's post.
Doing a quick search online, I read that the amount of computation has "quadratic time and space complexity of attention layer computations w.r.t. the number of input tokens n."
Whereas the complexity scales linearly with the dimensionality of the embedding vector (the size of the 'vocabulary' of tokens).
"an embedding space with 26 dimensions doesn't seem likely to capture much useful world knowledge" - yes, but I would guess that in the first layer combinations of characters could be combined into words quite easily.
My guess is early on people did studies and showed that accuracy on most tasks is about the same with subword vs character-level tokenization. So "subword" tokenization became standard in all the software stacks so they could increase context length given their computational budget for training. However I wonder if, given all the tricks people have developed for increasing context length, maybe moving to character level tokenization is reasonable now?
If you learn anything interesting about this subject, I'd appreciate if you commented and let me know!
The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.
or "can coughing effectiveness can stop a heart attack"
again I got the correct answer.
Yes, good point. I agree some of this article may be misleading.. I do point to how RLHF and scaling have helped with some of these issues, but I should have been a bit more clear.
OpenAI often talks about how they are not updating the "base models" (ChatpGPT, GPT-4, etc) but clearly they are adding "patches" on top of the model. They have been playing "wack a mole" with weird bugs that people find. The SolidGoldMagikarp thing is a perfect example.
Regarding the coughing example, that was with GPT-3. If you can access GPT-3 (for instance via the API or maybe on the OpenAI playground then you should be able to replicate the issue). As I mentioned, factually incorrect info is a much less of an issue with ChatGPT (GPT-3.5) and GPT-4 due to RLHF (and possibly other techniques we don't know about). However, as I try to explain a bit, spouting factually incorrect info from the training data is a fundamental issue with LLMs and how they work. "Garbage in garbage out" still holds. The issue has just been partially solved by RLHF fine-tuning but in my view that is just "papering over" the core problem. Additionally, as a side note, RLHF makes these models perform a bit worse on some benchmarks if you dig into the data.
Thanks for the calarity
Nice discussion. Maybe a new job categories will emerge 'LLM docs" and "LLM Shrinks" :-) Current LLMs are a step, obviously not the end stage of AI. They're fascinating despite all their pathologies. The real questions revolve around 'our' expectations, and the reasonability of our expectations. Should we expect them to be perfect? No! Can they help in some tasks? It depends. I've been playing with several LLMs, and my conclusion is that they're helpful, but not necessarily correct or authoritative. It seems like they've been diluted by the "Alignment Enforcers". Do I expect complete or perfect answers? Absolutely not! Their training is somewhat if not seriously pathetic, in the sense that they basically use what was available in various massive crawls. Many of these crawls include seriously pathological text. Some of the texts are actually deliberately misleading, while a good portion is plainly ignorant /wrong. The best 'knowledge' usually doesn't appear in crawlable text, and often doesn't appear on any electronic medium, and rarely even in writing. There's what's called tacit knowledge that people use that never gets written down.
As a simple explicit example of that - In one of the places I worked, there was a person who was likely the best 'maker' of high frequency ultrasound transducers in the US. Making transducers is a rare art, it involves mixing materials and basically 'cooking'/'baking' them under special conditions in a unique 'oven'. As he was aging and getting ready to retire, our lab tried to induce him to write his procedures / recipe down, we provided him with personnel to train. Nope!, He wouldn't allow anyone to come into his building (yes, his own building, created to fabricate the transducers) and even watch what he was doing. Uncle Sam's Canoe Club had him on its payroll for about 4 decades, and when he retired, that knowledge was gone, the building demolished, and the capability to recreate what he did vanished. This is actually a common occurrence. A more famous version of it is Joseph von Fraunhofer closely held procedure for making Optical Glass. F made the best optical glass in the world. Germany gained a huge lead over other countries in making optical equipment, English high precision optical industry suffered as a result, Michael Faraday, one of England's best experimentalists, was tasked to reverse engineer Fraunhofer's process, and he gave up after 3 years.
What's the lesson? Really outstanding knowledge (aka 'the secret sauce' for amazing products) will likely never appear in LLMs ...
The good parts? Are there any good parts? Yes! I asked one of the LLMs to write me a python code that would print out integers that are both prime and Fibonacci. I never experiment with that before, The question just popped into my mind and within seconds of typing the prompt into one of the GPTs I had code that ran the first time. I would consider that pretty good. Was it complete? did it miss anything? I am now into checking that with mples of famous sequences in the On-Line Encyclopedia of Integer Sequences (the OEIS) [ https://oeis.org], (and am finding fascinating things). Did the LLM do its job, I would say pretty decently for a few second effort. I think one must develop certain workflow habits of knowing what to ask and how to check. Don't assume answers are complete or unique; assume there will be errors and hallucinations (performance there is getting better, in my opinion). LLMs are a tool, when you look at the root of the meaning of 'Artificial', you might encounter Francis Bacon's Artificial means that which is produced by art or human artifice. - and as we know ... humans aren't totally perfect in making machines that are perfect. :-)
Your story about synthesizing transducers is fascinating, as is the story of Joseph von Fraunhofer!
As far as the "list of Fibonacci numbers that are not primes" question, it appears it is online (https://www.geeksforgeeks.org/prime-numbers-fibonacci/).
One of things I've learned since ChatGPT came out is that it's quite hard to come up with genuinely original questions!
You did a great job on this but didn't go into problems with image generation. People with three legs or one, and non-photographic images with weird lettering and numbering. This makes Dall-E unusable for me.
You're right.. that would be an entirely separate post!
There are also weird bugs and weaknesses in GPT-4's image analysis.