GPT-4o’s Chinese token-training data is polluted by spam and porn websites
Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops....
![](https://kbin.life/media/cache/resolve/entry_thumb/97/f5/97f584b8da5ad3c7846a405ad2683bf57e6672d0b1806f13735b2055aaf0ec5c.jpg)