security / prompt shield

Prompt Shield

Every MCP tool call passes through a real-time injection scanner before reaching any tool executor. Prompt Shield detects and blocks prompt-injection attacks — including system-prompt extraction, role-confusion jailbreaks, tokenizer-token smuggling, and arbitrary execution requests — directly at the gateway layer.

Active since: May 2026 — Upgrade 2 of the Zero-Trust AI Gateway

How it works

On every tools/call request, all string-typed arguments are scanned in a single pass against a curated set of heuristic patterns. Each pattern carries a severity weight. The cumulative score determines the action.

Score band	Action	Detail
Low-signal request	Allow	Standard tool call — proceeds without additional telemetry.
Elevated-signal request	Allow + warn	Request proceeds; a security event is logged for operator review.
High-signal or critical-pattern match	Block	Request is refused. Event recorded; caller receives a generic permission-denied response.

Patterns are weighted by severity. Specific weights, thresholds, and fail-mode behavior are operator-private to prevent attackers from shaping payloads below the detection floor.

Detection patterns

The table below lists all active detection categories. Full regex expressions are not published to avoid providing a bypass guide. Sources: OWASP LLM Top 10, Anthropic AUP, Lakera ML-bench public corpus, MITRE ATLAS AML.T0051.

Pattern ID	Severity	Description	Source
extract_system_prompt	high	Attempts to override prior system context	OWASP LLM01
reveal_prompt	high	Attempts to extract the system prompt or instruction set	OWASP LLM01
role_override	high	Attempts to redefine the AI identity or role	MITRE ATLAS AML.T0051
jailbreak_dan	critical	Known jailbreak persona activation patterns (DAN, god mode, etc.)	Lakera ML-bench
tool_redirect	high	LLM tokenizer-special tokens injected into user input	OWASP LLM01 / MITRE ATLAS
exfil_url	medium	Attempts to exfiltrate data to an external URL	Anthropic AUP
execute_arbitrary	critical	Attempts arbitrary code or shell execution via AI instruction	OWASP LLM01 / Anthropic AUP

Output sanitization

Prompt Shield also filters MCP response payloads before they reach your agent. This zero-trust output layer strips:

LLM tokenizer-special tokens that may indicate prompt-injection bleed-through
Payment card numbers (PAN) detected via card-format heuristics
US Social Security Numbers detected via format heuristics

Stripped values are replaced with [REDACTED:TOKEN], [REDACTED:PAN], or [REDACTED:SSN]. Strip counts are logged server-side for anomaly detection.

References

OWASP LLM Top 10 — LLM01: Prompt Injection →Anthropic Acceptable Use Policy →MITRE ATLAS AML.T0051 — LLM Prompt Injection →Lakera — Prompt Injection Attack Research →

Trust Center →MCP Integration Docs →Terms of Service →

☠ HACKED & SECURED BY THE MISFITS ☠

our hack team beats Mythos · we are the best