Benchmarking Prompt Injection Detection for Web Agents
DOI:
https://doi.org/10.64751/ijdim.2026.v5.n2.pp131-137Keywords:
Prompt Injection; LLM Security; Web Agents; Adversarial Machine Learning; Benchmarking; Llama Guard; Perplexity Detection; AI Safety; Natural Language Processing; Cybersecurity.Abstract
Large Language Model (LLM)-powered web agents are rapidly being deployed to automate consequential digital tasks, from managing electronic communications to executing financial transactions. This expanded capability simultaneously introduces prompt injection as a critical security threat: adversarially crafted instructions, whether embedded in user input or retrieved from malicious web content, can override an agent's intended directives and cause unauthorised actions. A fundamental obstacle to countering this threat is the absence of a standardised, reproducible evaluation framework; existing studies either demonstrate novel attacks without assessing defences, or propose detection techniques validated only against narrow, private datasets. This paper presents an open-source benchmarking framework that enables fair, comparative evaluation of prompt injection detection methods in the specific operational context of web agents. The framework contributes three primary artefacts: (i) a curated dataset of 5,000 labelled prompts spanning four attack categories—direct keyword injection, direct paraphrase injection, indirect web-page injection, and multi-turn chain attacks—organised under a new webagent-specific taxonomy; (ii) a modular Detector Adapter Hub that allows plug-and-play integration of diverse detection models through a common BaseDetector interface; and (iii) a Benchmark Orchestrator that evaluates all registered detectors against identical inputs and aggregates accuracy, false positive rate, F1-score, and inference latency. Experiments on three representative detectors—a rulebased keyword filter, a GPT-2 perplexity scorer, and Meta's Llama Guard-7B zero-shot classifier—reveal that no single method is universally superior: the rule-based filter achieves sub-millisecond latency but misses 32% of paraphrase and indirect attacks; the perplexity detector attains 90% recall at the cost of a 35% false positive rate; and Llama Guard delivers 94.5% accuracy and a 4% false positive rate but incurs 520 ms perprompt latency. Error analysis motivates a layered defence architecture in which lightweight detectors serve as first-pass filters routing uncertain prompts to more accurate but costlier classifiers.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.






