The reward function for an LLM is about generating a next word that is reasonable. It’s like a road-building robot that’s rewarded for each millimeter of road built, but has no intention to connect cities or anything. It doesn’t understand what cities are. It doesn’t even understand what a road is. It just knows how to incrementally add another millimeter of gravel and asphalt that an outside observer would call a road.
If it happens to connect cities it’s because a lot of the roads it was trained on connect cities. But, if its training data also happens to contain a NASCAR oval, it might end up building a NASCAR oval instead of a road between cities.