Favorites - Reward models (aka reinforcement learning) and preference optimization models...

model_tar_gz , 1 month ago (edited 1 month ago)

Reward models (aka reinforcement learning) and preference optimization models can come to some conclusions that we humans find very strange when they learn from patterns in the data they’re trained on. Especially when those incentives and preferences are evaluated (or generated) by other models. Some of these models could very well could come to the conclusion that nuking every advanced-tech human civilization is the optimal way to improve the human species because we have such rampant racism, classism, nationalism, and every other schism that perpetuates us treating each other as enemies to be destroyed and exploited.

Sure, we will build ethical guard rails. And we will proclaim to have human-in-the-loop decision agents, but we’re building towards autonomy and edge/corner-cases always exist in any framework you constrain a system to.

I’m an AI Engineer working in autonomous agentic systems—these are things we (as an industry) are talking about—but to be quite frank, there are not robust solutions to this yet. There may never be. Think about raising a teenager—one that is driven strictly by logic, probabilistic optimization, and outcome incentive optimization.

It’s a tough problem. The naive-trivial solution that’s also impossible is to simply halt and ban all AI development. Turing opened Pandora’s box before any of our time.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

clay_pidgin 1 month ago
tal 1 month ago
TheChargedCreeper864 1 month ago