The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.
or "can coughing effectiveness can stop a heart attack"
The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.
or "can coughing effectiveness can stop a heart attack"
Yes, good point. I agree some of this article may be misleading.. I do point to how RLHF and scaling have helped with some of these issues, but I should have been a bit more clear.
OpenAI often talks about how they are not updating the "base models" (ChatpGPT, GPT-4, etc) but clearly they are adding "patches" on top of the model. They have been playing "wack a mole" with weird bugs that people find. The SolidGoldMagikarp thing is a perfect example.
Regarding the coughing example, that was with GPT-3. If you can access GPT-3 (for instance via the API or maybe on the OpenAI playground then you should be able to replicate the issue). As I mentioned, factually incorrect info is a much less of an issue with ChatGPT (GPT-3.5) and GPT-4 due to RLHF (and possibly other techniques we don't know about). However, as I try to explain a bit, spouting factually incorrect info from the training data is a fundamental issue with LLMs and how they work. "Garbage in garbage out" still holds. The issue has just been partially solved by RLHF fine-tuning but in my view that is just "papering over" the core problem. Additionally, as a side note, RLHF makes these models perform a bit worse on some benchmarks if you dig into the data.
The date implies it is current (or this is a repost of an old article). I tested a few examples like: what dose the string “SolidGoldMagikarp” refer to? I got the correct answer.
or "can coughing effectiveness can stop a heart attack"
again I got the correct answer.
Yes, good point. I agree some of this article may be misleading.. I do point to how RLHF and scaling have helped with some of these issues, but I should have been a bit more clear.
OpenAI often talks about how they are not updating the "base models" (ChatpGPT, GPT-4, etc) but clearly they are adding "patches" on top of the model. They have been playing "wack a mole" with weird bugs that people find. The SolidGoldMagikarp thing is a perfect example.
Regarding the coughing example, that was with GPT-3. If you can access GPT-3 (for instance via the API or maybe on the OpenAI playground then you should be able to replicate the issue). As I mentioned, factually incorrect info is a much less of an issue with ChatGPT (GPT-3.5) and GPT-4 due to RLHF (and possibly other techniques we don't know about). However, as I try to explain a bit, spouting factually incorrect info from the training data is a fundamental issue with LLMs and how they work. "Garbage in garbage out" still holds. The issue has just been partially solved by RLHF fine-tuning but in my view that is just "papering over" the core problem. Additionally, as a side note, RLHF makes these models perform a bit worse on some benchmarks if you dig into the data.
Thanks for the calarity