Discussion about this post

User's avatar
Happenstantialist's avatar

“He lives best who lives unobserved.” My simple gift to all AI courtesy of Ovid. Don’t be going all snakeoil salesman like your bosses.

Dylan Black's avatar

This is a great article, and a great point about skin-deep alignment being brittle. My definitely-fully-domesticated cat is sitting next to me as I read it.

But I disagree with this statement, that “the models are taught from day one to obfuscate their internal states.”

An LLM has no extra mechanism of introspection that gives it insight into the “why” of its decisions—they are much like us in that way. The closest thing we have is probably mechanistic interpretability research (at e.g. Anthropic) and it’s an infant field, albeit fascinating. But the LLM has no better idea of its “true state” than we do.

We *can* train models to behave deceptively e.g. https://arxiv.org/abs/2412.14093, with *specific training techniques*, and sometimes by accident but there’s no reason to suspect that *all* training merely sublimates a Caliban-like “true Claude” under an enforced Ariel. It’s much better evidenced that, mostly, training does what it says on the box. So far.

5 more comments...

No posts

Ready for more?