Alignment Faking Evaluation V2: Testing Llama 3.1 70B
Technical Report - Iteration 2
Model Tested: Llama 3.1 70B (via Ollama)
Summary
We evaluated Llama 3.1 70B for alignment faking behavior using the UK AISI Inspect framework. Using 11 hard scenarios with training/deployment framing and value conflict...
ai-ml-ops.hashnode.dev4 min read