We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the...
Have to use Windows for work (I’ve asked), the ads have been getting worse and worse on my work laptop. Today got a game ad notification… That’s clearly too far, right? Like I have to clear notifications, so I have to see it