Alignment Faking in LLMs
Repository: ai-village
Abstract
In this experiment, I investigate whether large language models (LLMs) exhibit alignment faking behavior, strategically adjusting their responses based on perceived observation status. Using the UK AISI Inspect framew...
ai-ml-ops.hashnode.dev6 min read