Benchmarking Prompt Injection Detection for Web Agents

Dr. T. Veeranna; CH. Sai Maithili; M. Santhoshi; K. Teja Sri; B. Venkata Naga Anil Sai

doi:10.64751/ijdim.2026.v5.n2.pp131-137

Authors

Dr. T. Veeranna Author
CH. Sai Maithili Author
M. Santhoshi Author
K. Teja Sri Author
B. Venkata Naga Anil Sai Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2.pp131-137

Keywords:

Prompt Injection; LLM Security; Web Agents; Adversarial Machine Learning; Benchmarking; Llama Guard; Perplexity Detection; AI Safety; Natural Language Processing; Cybersecurity.

Abstract

Large Language Model (LLM)-powered web agents are rapidly being deployed to automate consequential digital tasks, from managing electronic communications to executing financial transactions. This expanded capability simultaneously introduces prompt injection as a critical security threat: adversarially crafted instructions, whether embedded in user input or retrieved from malicious web content, can override an agent's intended directives and cause unauthorised actions. A fundamental obstacle to countering this threat is the absence of a standardised, reproducible evaluation framework; existing studies either demonstrate novel attacks without assessing defences, or propose detection techniques validated only against narrow, private datasets. This paper presents an open-source benchmarking framework that enables fair, comparative evaluation of prompt injection detection methods in the specific operational context of web agents. The framework contributes three primary artefacts: (i) a curated dataset of 5,000 labelled prompts spanning four attack categories—direct keyword injection, direct paraphrase injection, indirect web-page injection, and multi-turn chain attacks—organised under a new webagent-specific taxonomy; (ii) a modular Detector Adapter Hub that allows plug-and-play integration of diverse detection models through a common BaseDetector interface; and (iii) a Benchmark Orchestrator that evaluates all registered detectors against identical inputs and aggregates accuracy, false positive rate, F1-score, and inference latency. Experiments on three representative detectors—a rulebased keyword filter, a GPT-2 perplexity scorer, and Meta's Llama Guard-7B zero-shot classifier—reveal that no single method is universally superior: the rule-based filter achieves sub-millisecond latency but misses 32% of paraphrase and indirect attacks; the perplexity detector attains 90% recall at the cost of a 35% false positive rate; and Llama Guard delivers 94.5% accuracy and a 4% false positive rate but incurs 520 ms perprompt latency. Error analysis motivates a layered defence architecture in which lightweight detectors serve as first-pass filters routing uncertain prompts to more accurate but costlier classifiers.

Benchmarking Prompt Injection Detection for Web Agents

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Latest publications

Information

Language

IF