Agentic AI Security: Why 97% Detection Is a Failing Grade

TL;DR

In agentic systems, 97% attack detection is not comfort. It means some attacks still get through, and one successful attack can be enough.
The real danger is normalisation of deviance: teams keep seeing near misses, nothing catastrophic happens, and the unsafe design starts to feel acceptable.
Secure agent design starts with containment, not confidence: reduce privileges, isolate untrusted context, gate high-risk actions, and cut one leg off the lethal trifecta.

In agentic systems, 97% attack detection is a failing grade.

In most software contexts that number sounds fine: spam filters, search relevance, feature classification. But agentic systems are not most software contexts.

If an AI system can read private information, consume untrusted inputs, and take actions or exfiltrate data, a three per cent failure rate is not a rounding error. It is the incident rate. That is the number you should assume eventually becomes leaked records, unauthorised actions, exposed secrets, or a public post-mortem nobody enjoys writing.

This is the security argument most AI roadmaps still underweight. The conversation gets framed as model accuracy, prompt quality, or jailbreak resistance. The real question is architectural: what damage becomes possible when the model is wrong once?

That is why 97% is a failing grade.

One successful attack is enough

Security people already understand this instinctively. Product teams often do not.

If I told you an agent saw 1,000 hostile or manipulated inputs in a month and your defence stack stopped 97% of them, that means 30 still got through. In a customer support chatbot that might be a nuisance. In an email agent with access to contracts, billing systems, customer records, and outbound actions, it is a disaster queue.

The mistake is treating agent security like a quality problem.

It is not a quality problem. It is a blast-radius problem.

A content generation feature can tolerate occasional weirdness because the consequence of a bad output is low and visible. An autonomous or semi-autonomous agent cannot tolerate the same error rate if the output can trigger payment, disclose private data, approve access, or message the wrong party.

This is why I have been so insistent that AI governance must be risk-tiered. The security expectations for an internal summary tool and a tool-calling agent are not remotely the same. Treating them alike is sloppy product thinking.

The real risk is normalisation of deviance

The most dangerous part of insecure agent design is not that teams do not know the risks exist.

Often they do.

They know the model can be manipulated by malicious instructions hidden inside documents, webpages, emails, or tickets. They know excessive permissions are dangerous. They know autonomous actions raise the stakes.

What happens instead is more subtle. The team launches anyway. Nothing catastrophic happens in week one. Nothing catastrophic happens in week two. A few strange behaviours show up, but they are recoverable. The incident queue stays quiet.

Unsafe becomes familiar.

Familiar becomes acceptable.

That pattern has a name outside AI: normalisation of deviance. A system keeps getting away with small violations of safe practice, so the organisation starts treating those violations as normal operating conditions. The absence of a visible disaster is misread as proof of safety.

That is exactly how many agent deployments feel right now.

You can see the temptation. The demos work. Customers want the magic. The team knows the next model release might improve attack resistance anyway. So the architecture ships with a little too much trust, a few too many permissions, and a lot of optimism.

Optimism is not a control.

Prompt injection is a product architecture problem

One reason teams underestimate this class of risk is the name. "Prompt injection" sounds like a prompting issue, so the fix sounds like better instructions, better system prompts, or stricter refusal behaviour.

That is not enough.

The deeper problem is that models do not reliably maintain the same trust boundaries that normal software systems do. An agent can be told, in effect, to treat malicious content as meaningful instruction. If that content arrives through a webpage, email, or document the model is allowed to read, the boundary between tool input and attacker input gets fuzzy fast.

So do not ask only, "Can the model detect bad instructions?"

Ask:

what sensitive data can it reach?
what untrusted content can it ingest?
what actions can it take if compromised?

That framing changes product decisions.

The route to safer agents is not just a smarter model. It is a narrower system.

That is why boring agents that work are strategically underrated. Narrow workflows, limited permissions, explicit SOPs, and constrained actions are not less impressive. They are more survivable.

Cut one leg off the trifecta

The simplest useful security heuristic for agentic systems is brutally practical: if you cannot solve the full problem, remove one leg of the lethal trifecta.

Remove private data access

If the system only works on low-sensitivity data, a successful manipulation attempt has less to steal. This is the logic behind hosted sandboxes, demo environments, and test repos. It does not make the agent safe. It makes failure cheaper.

Remove untrusted input exposure

If the agent never reads arbitrary documents, emails, webpages, or user-provided text, the attack surface shrinks sharply. Many teams will hate this because it reduces the magic. That does not make it the wrong trade-off.

Remove autonomous action

If the agent can analyse but not send, recommend but not approve, draft but not execute, the damage profile changes. This is often the most practical first step in regulated or high-stakes environments.

You do not need a perfect answer to get meaningfully safer.

You do need the discipline to cut capability where the risk justifies it.

Human approval only works when it is rare

There is another trap here. Teams realise autonomous execution is risky, so they add human approval to everything.

That sounds responsible. It often fails.

If the human is asked to approve five actions a minute, they stop reviewing and start clicking. The human becomes theatre. This is why good agent design does not mean "put a person in every loop". It means putting a person in the right loops, at the high-risk junctions where approval meaningfully changes the risk profile.

The product task is not "keep a human involved somehow".

The task is to design the workflow so the human sees the few decisions that matter most. I covered the broader operational version of this in The Agentic Safety Inspection. The same principle applies in security. Sparse, meaningful review beats constant low-value approval prompts.

Security is now a product design decision

This is the part many teams still resist.

They want security to sit downstream of product ambition. Product designs the magical agent. Security reviews it later. Governance tidies up the risk.

That sequence no longer works.

In agentic systems, security decisions are product decisions from the start. Scope, permissions, workflow boundaries, approval paths, and access to customer context are all part of the user experience and the threat model at the same time.

A PM who says "we'll sort that out with security later" is not postponing implementation detail. They are deferring a core product decision.

I would go further. In AI products, one dimension of product quality is whether the system fails safely under adversarial conditions. A delightful agent that exposes private information is not a good product. It is an attractive liability.

What to do now

If you are shipping or evaluating agentic systems today, I would start with five questions.

Where does the agent get exposed to untrusted text, documents, webpages, or messages?
What private data can it access if manipulated?
What tools can it call, and what is the worst action those tools enable?
Which approvals are genuinely high risk, and which ones are just adding noise?
Which leg of the trifecta can you remove before launch?

That last question is the most important one.

Do not wait for the perfect security architecture. Most teams do not have one. Do not pretend model upgrades alone will save you. They will help, then they will move the line, then attackers will adjust.

Start by reducing the damage a successful attack can cause.

That is a product strategy. It is also the difference between a recoverable incident and a business-ending one.

The uncomfortable truth

The market clearly wants more autonomous AI systems. Every month the demos get better, the agent wrappers get more convincing, and the pressure to ship something magical increases.

That pressure does not change the maths.

If your architecture assumes the model will recognise almost every attack, you do not have a security strategy. You have a hope strategy. Hope is fine for toy workflows and internal experiments. It is irresponsible once the system can touch real customer data or take real external action.

The teams that win in agentic AI will not be the ones that ignore this.

They will be the ones that design for the day the model gets fooled, because eventually it will.

Frequently Asked Questions

Are you saying autonomous agents should never ship?

No. I am saying the acceptable error rate depends on the blast radius. Low-risk workflows can tolerate far more autonomy than high-risk workflows involving private data, regulated decisions, or external action.

Is prompt injection the only issue here?

No. It is one class of failure. The larger point is that agentic systems often blur trust boundaries between trusted instructions, untrusted content, tool access, and actions. The architecture matters more than the label.

What is the first practical step for a product team?

Map the agent against the lethal trifecta. List its data access, untrusted input surfaces, and actions. Then remove one leg before launch. That single design decision often reduces risk more than another week spent tightening prompts.