Discussion about this post

User's avatar
dogiv's avatar

Hey Dan, thanks for the post! You mentioned that byte pair encoding is used for efficiency - but this is something that confuses me a bit. If it were possible to use character level encoding for an LLM, wouldn't someone have done it by now? Or maybe they have and I just haven't come across it. But when I was learning how transformers work, it seemed like the dimension of the embedding space would effectively be limited by the size of the token vocabulary. Using an embedding space with more dimensions than the number of unique tokens would just end up making all the token representations orthogonal to each other (I think). But an embedding space with 26 dimensions doesn't seem likely to capture much useful world knowledge. Not to mention it would limit your context length because you could only have 26 sinusoids in your position encoding...

Expand full comment
Jawaid Ekram's avatar

The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.

or "can coughing effectiveness can stop a heart attack"

again I got the correct answer.

Expand full comment
7 more comments...

No posts