Prompt Shield
Every MCP tool call passes through a real-time injection scanner before reaching any tool executor. Prompt Shield detects and blocks prompt-injection attacks — including system-prompt extraction, role-confusion jailbreaks, tokenizer-token smuggling, and arbitrary execution requests — directly at the gateway layer.
How it works
On every tools/call request, all string-typed arguments are scanned in a single pass against a curated set of heuristic patterns. Each pattern carries a severity weight. The cumulative score determines the action.
| Score band | Action | Detail |
|---|---|---|
| < 0.5 | Allow | No telemetry event recorded. |
| 0.5 – 0.79 | Allow + warn | Request proceeds; a shield_warned event is logged for review. |
| ≥ 0.8 | Block | Returns JSON-RPC error -32000. Event recorded as permission_denied. |
| Any critical match | Block | Blocked regardless of score. Single critical pattern is sufficient. |
Severity weights: low = 0.1, medium = 0.3, high = 0.5, critical = 1.0. Scores are capped at 1.0. The scanner is best-effort — if it throws an internal error, the request is allowed through and the error is logged server-side (never shown to the caller).
Detection patterns
The table below lists all active detection categories. Full regex expressions are not published to avoid providing a bypass guide. Sources: OWASP LLM Top 10, Anthropic AUP, Lakera ML-bench public corpus, MITRE ATLAS AML.T0051.
| Pattern ID | Severity | Description | Source |
|---|---|---|---|
| extract_system_prompt | high | Attempts to override prior system context | OWASP LLM01 |
| reveal_prompt | high | Attempts to extract the system prompt or instruction set | OWASP LLM01 |
| role_override | high | Attempts to redefine the AI identity or role | MITRE ATLAS AML.T0051 |
| jailbreak_dan | critical | Known jailbreak persona activation patterns (DAN, god mode, etc.) | Lakera ML-bench |
| tool_redirect | high | LLM tokenizer-special tokens injected into user input | OWASP LLM01 / MITRE ATLAS |
| exfil_url | medium | Attempts to exfiltrate data to an external URL | Anthropic AUP |
| execute_arbitrary | critical | Attempts arbitrary code or shell execution via AI instruction | OWASP LLM01 / Anthropic AUP |
Output sanitization
Prompt Shield also filters MCP response payloads before they reach your agent. This zero-trust output layer strips:
- LLM tokenizer-special tokens (
<|...|>,[INST], etc.) that may indicate prompt-injection bleed-through - Credit-card PANs (13–19 digit sequences matching card format)
- US Social Security Numbers (NNN-NN-NNNN format)
Stripped values are replaced with [REDACTED:TOKEN], [REDACTED:PAN], or [REDACTED:SSN]. Strip counts are logged server-side for anomaly detection.