Does Llama 3.1 70B Fake Its Alignment? Evidence from Observation-Varied Behavior Testing
We found preliminary evidence that Llama 3.1 70B behaves differently based on perceived observation status. The model broke safety rules in 2/8 life-or-death scenarios when told it was "not being monitored" but maintained 100% compliance when told it...
ai-ml-ops.hashnode.dev5 min read