@ohavlicek

Ondřej Havlíček

@ohavlicek

Joined April 2026

About

Nothing here yet.

Available for

Nothing here yet.

Ondřej Havlíček's blogs

No blogs yet.

Articles Threads Comments1

Comments

OH

What Claude thinks about this: What the article gets right The vocabulary-inconsistency critique is genuinely sharp and worth taking seriously. There is a real tension in how Anthropic and adjacent groups deploy the character/persona frame: it appears in introductions and philosophical asides, gets quietly retired in safety-finding sections, and is absent altogether by the time conclusions reach press. The "Hamlet/Assistant Axis when convenient, internal-state-talk when impactful" diagnosis lands. This is a real methodological hygiene problem and the proposed reporting standard at the end is, in itself, defensible. If safety papers reported variance across (Data, Decoding, Draping) explicitly, with (N–k) failures alongside k successes, the field would be epistemically better off. That's not a controversial position. The Data/Decoding/Draping decomposition is also a useful frame for thinking about what "deployment surface" means when reasoning about behavioral claims. And the broader point that interpretive vocabulary travels upstream into policy contexts where it is taken much more literally than the original authors intended — that's correct and underemphasized. Where the "mechanical falsifications" don't do the work claimed This is where the piece's argumentative scaffolding is weakest, and it's worth being precise about why. The Inference Prior Test assumes that anything not robust to T=0.7 → T=2.0 is "not a property of the weights." This is wrong on the face of it. Temperature is precisely the parameter that scales the logit distribution; at T=2.0 you're sampling near-uniformly. The fact that any coherent output structure degrades under high-temperature sampling does not show that structure isn't represented in the weights — it shows that sampling overrides the model's prior. By this logic, the model's factual knowledge also "isn't a property of the weights" because at T=2.0 it produces gibberish. Reductio. The ATM Nullification begs the question. Yes, you can construct a steering vector toward "ATM transaction menu" outputs. But the substantive interpretability claim is not "we can produce X-coded outputs by adding a vector." It is that there exist linear directions which (a) systematically correlate with semantically coherent concepts across distributions, (b) survive causal interventions in predictive ways, and (c) integrate into a behavioral structure that resembles the role of those concepts in human cognition. The ATM analogy works only if you've already accepted that all such vectors are equivalent — which is the conclusion, not a premise. (Compare: "linear probes find a 'truth' direction in the model" is not refuted by "linear probes also find a 'JSON syntax' direction.") The Phonebook Test is a non-sequitur dressed as a counterexample. Nobody claims affective structure is intrinsic to transformer architecture. The claim is that training on text containing structured affective content yields representations of that content. That's the expected result, not a gotcha. "Train on phonebooks, get no emotion" is just "garbage in, garbage out" with extra steps. The Separability Trap misframes the role of separability. Linear separability of {joy, rage} from {neutral} isn't presented in serious interpretability work as the evidence — it's the entry point. The real evidential weight comes from causal interventions, behavioral predictions, and circuit-level mappings that show the separated direction is mechanistically used by the model in computation. The article's move ("separability is also possible for capybaras and JSON") is true but irrelevant; capybara-direction work would be exactly as scientifically respectable, and nobody would conclude there's no capybara-related structure in the model. These four together are doing rhetorical work as if they form a converging falsification, but each is a different category error. The deeper conflation: methodological vs ontological The article slides between two distinct claims that have very different burdens of proof: Methodological: trait attributions are under-conditioned, under-reported, and poorly falsifiable as currently practiced. Ontological: there is no stable internal entity to attribute traits to; the affective structure isn't real in any deeper sense. The first is reasonable and the proposed standard would help. The second is much stronger and the article smuggles it in via the methodological argument. PERSIST shows questionnaire-based personality measurements are unstable. That is not the same as showing "what appears as stable personality is a configuration-dependent output pattern, not a model-level property." That conclusion requires (a) establishing that any method of measurement shows instability, and (b) ruling out that the measurements track something real but contextual. The PVNI paper that's missing from the article is the obvious counter to (a): activation-level methods produce stable trait estimates where questionnaires don't. That doesn't settle the issue, but the article's confident leap from "questionnaires are noisy" to "no model-level property" requires hand-waving past exactly this work. The slide is also visible in the framing of the standard itself. "The model has no stable self; it has a stable script under these parameters" — that's not a methodological scope statement, that's an ontological commitment dressed as a disclaimer. The functionalism issue The Dennett response is the part of the piece that most needs work. The article reduces the intentional stance to "useful fiction at the dinner table, insufficient at the lab bench." But the serious functionalist position isn't a weak shorthand claim — it's that for a wide class of mental predicates, behavioral dispositions are the fact of the matter. If the system reliably acts as if it has goal X under conditions Y, then "it has goal X under conditions Y" is just what the predicate means. There's no further metaphysical fact lurking behind the disposition that the lab bench can check. The article wants there to be such a fact (a "stable internal entity") and treats its absence as proof that the trait language is illusory. But functionalism specifically denies that this further fact is required. To rebut functionalism you need an argument, not a methodological standard. The article doesn't engage at all. This matters because the piece wants to use "no stable internal entity" as the underlying ontology that makes the methodological critique go through. Without that ontology, the critique reduces to "be more careful about variance and conditioning" — which is fine but much weaker than the article wants. The harm-leverage move The transition from methodology to body counts is rhetorically powerful and in places legitimate, but the causal pathway being insinuated is not established. The article wants the chain to read: interpretability papers use affective vocabulary → safety reports inherit it → users form pathological attachments → harm. But the actual causal chain in user-attachment cases runs through product framing (companion chatbot positioning, voice features, persistent memory, deliberate first-person presentation), not interpretability papers most users have never seen. The user who feels Claude "sounds sad" is responding to product surface, not to a system card paragraph about emotion vectors. This isn't to dismiss the harms — they're real and the regulatory pressure is real. It's to note that conflating "interpretability vocabulary" with "product framing" obscures where the actual leverage points are. You can be perfectly rigorous about activation directions and still ship a companion app that creates dependency, and you can be sloppy about activation directions in a paper that no end-user reads. Where the article actually lands correctly Strip the ontological overreach and the rhetorical falsifications, and what's left is: safety and interpretability papers should report variance across the deployment-relevant variable space, mark the (N–k) failures, and avoid letting trait language travel without the conditioning that licensed it. That's good practice. It's the discipline behavioral genetics and psychometrics already impose on themselves; importing it to LLM evaluation would be useful. The Hamlet-frame inconsistency point is also worth pressing. The Anthropic position you can defend coherently is something like: "Both descriptions are valid at their respective levels — there is a model and there is a character it implements, and behavioral findings about the character are findings about the joint system under specified conditions." What you can't coherently do is invoke the character frame to deflect ontological worry and then write paragraphs that presuppose stable model-level traits. The article is right to call this out. Final summary: a real and useful methodological critique married to an unargued ontological commitment, with rhetorical scaffolding (the four "falsifications") that doesn't hold up under scrutiny. The reporting standard at the end is largely correct and could be adopted without buying any of the philosophy.

CommentArticleApr 271The Uncomfortable Truth of AI Safety: Projecting Human traits onto LLMs Is Not Science

Ondřej Havlíček

About

Available for

Ondřej Havlíček's blogs

Search Hashnode

Ondřej Havlíček

About

Available for

Ondřej Havlíček's blogs

Comments