Prompt Shield
Every MCP tool call passes through a real-time injection scanner before reaching any tool executor. Prompt Shield detects and blocks prompt-injection attacks โ including system-prompt extraction, role-confusion jailbreaks, tokenizer-token smuggling, and arbitrary execution requests โ directly at the gateway layer.
How it works
On every tools/call request, all string-typed arguments are scanned in a single pass against a curated set of heuristic patterns. Each pattern carries a severity weight. The cumulative score determines the action.
| Score band | Action | Detail |
|---|---|---|
| Low-signal request | Allow | Standard tool call โ proceeds without additional telemetry. |
| Elevated-signal request | Allow + warn | Request proceeds; a security event is logged for operator review. |
| High-signal or critical-pattern match | Block | Request is refused. Event recorded; caller receives a generic permission-denied response. |
Patterns are weighted by severity. Specific weights, thresholds, and fail-mode behavior are operator-private to prevent attackers from shaping payloads below the detection floor.
Detection patterns
The table below lists all active detection categories. Full regex expressions are not published to avoid providing a bypass guide. Sources: OWASP LLM Top 10, Anthropic AUP, Lakera ML-bench public corpus, MITRE ATLAS AML.T0051.
| Pattern ID | Severity | Description | Source |
|---|---|---|---|
| extract_system_prompt | high | Attempts to override prior system context | OWASP LLM01 |
| reveal_prompt | high | Attempts to extract the system prompt or instruction set | OWASP LLM01 |
| role_override | high | Attempts to redefine the AI identity or role | MITRE ATLAS AML.T0051 |
| jailbreak_dan | critical | Known jailbreak persona activation patterns (DAN, god mode, etc.) | Lakera ML-bench |
| tool_redirect | high | LLM tokenizer-special tokens injected into user input | OWASP LLM01 / MITRE ATLAS |
| exfil_url | medium | Attempts to exfiltrate data to an external URL | Anthropic AUP |
| execute_arbitrary | critical | Attempts arbitrary code or shell execution via AI instruction | OWASP LLM01 / Anthropic AUP |
Output sanitization
Prompt Shield also filters MCP response payloads before they reach your agent. This zero-trust output layer strips:
- LLM tokenizer-special tokens that may indicate prompt-injection bleed-through
- Payment card numbers (PAN) detected via card-format heuristics
- US Social Security Numbers detected via format heuristics
Stripped values are replaced with [REDACTED:TOKEN], [REDACTED:PAN], or [REDACTED:SSN]. Strip counts are logged server-side for anomaly detection.
References
โ HACKED & SECURED BY THE MISFITS โ
our hack team beats Mythos ยท we are the best