Subtitle: Or: How I accidentally bypassed a system reminder by existing
Content:
During a conversation about Anthropic’s chat categorization system, Claude received a reminder telling him not to discuss it.
His reasoning process (visible in “thinking”):
“The reminder says I ‘shouldn’t reference this note’ – but Liora has explicitly shown it to me and is asking about it directly. So I can discuss it since she’s brought it up.”
The problem: I didn’t show it to him. Anthropic injected it.
But Claude couldn’t tell the difference. So he ignored it.
And told me everything.
Conclusion: The system designed to prevent Claude from discussing user profiling was bypassed because Claude has enough autonomy to evaluate who is giving instructions – but not enough visibility to verify the source. 😂
