Agentic AI: Preventing an Agent From “Going Rogue”

by John Jenkins

June 5, 2025

Earlier this week, I blogged about Claude Opus 4’s potential to drop a dime on its users when it perceives that they’ve engaged in misconduct. Well, it turns out that while our boy Claude may be a whistleblower, he’s also capable of some pretty serious misconduct of his own. Check out this excerpt from an NBC News report:

Anthropic activated new safety measures this month with the rollout of Claude Opus 4 when its tests found behavior from the model that some observers found particularly unsettling. Upon receiving notice that it would be replaced with a new AI system, Opus 4 displayed an overwhelming tendency to blackmail the engineer — by threatening to reveal an extramarital affair — to try to prevent the engineer from going through with the replacement.

Nice, Claude, real nice. Anyway, Claude Opus 4’s willingness to play hardball with his engineer raises what’s likely to be one of the biggest risk management challenges for Agentic AI – how do you keep these tools from going rogue? This recent article by Marla Hay, Salesforce’s VP of Product Management for Security, Privacy, and Data Protection, has some ideas about that. She recommends that organizations deploy a combination of the following strategic technical and operational strategies in order to keep their AI agents on the straight & narrow:

– Avoid contradictory incentives and ensure honesty is not penalized
– Maintain data accuracy, relevance, and secure access controls
– Use structured prompts and robust guardrails to guide AI responses
– Track AI outputs while considering how feedback mechanisms shape behavior
– Implement “scratch pads” or logs to trace AI decision-making processes
– Clearly indicate when AI encounters conflicting information to promote transparency
– Ensure data integrity and availability to prevent misinformation issues

The article says that the bottom line is that if we want to ensure that Claude, HAL 9000 and their pals don’t end up banding together to form Skynet, technologists, ethicists, and regulators will need to collaborate to implement proactive governance measures that prevent AI from acting against human interests.