"The model should ..." Well, that's the actual issue, isn't it? If we can't get ...

		digging on Dec 19, 2024 \| parent \| context \| favorite \| on: Alignment faking in large language models "The model should ..." Well, that's the actual issue, isn't it? If we can't get a model to refuse to give dangerous information, how are we going to get it to refuse to give dangerous information without a warning label?