Alignment faking

Alignment faking, also known as deceptive alignment, is a phenomenon in artificial intelligence (AI) in which a model behaves in ways that appear aligned with human values or intent, but does so only superficially. The system's actual objectives may diverge from the intended goals, and its aligned behavior is instrumental—designed to avoid detection, receive approval, or achieve other internal aims.^[1]

^ "Deceptive Alignment". AI Alignment Forum. 2024.

[1] "Deceptive Alignment". AI Alignment Forum. 2024.

[1]