Benchmarking Prompt Injection Detection for Web Agents

Authors

  • Dr. T. Veeranna Author
  • CH. Sai Maithili Author
  • M. Santhoshi Author
  • K. Teja Sri Author
  • B. Venkata Naga Anil Sai Author

DOI:

https://doi.org/10.64751/ijdim.2026.v5.n2.pp131-137

Keywords:

Prompt Injection; LLM Security; Web Agents; Adversarial Machine Learning; Benchmarking; Llama Guard; Perplexity Detection; AI Safety; Natural Language Processing; Cybersecurity.

Abstract

Large Language Model (LLM)-powered web agents are rapidly being deployed to automate consequential digital tasks, from managing electronic communications to executing financial transactions. This expanded capability simultaneously introduces prompt injection as a critical security threat: adversarially crafted instructions, whether embedded in user input or retrieved from malicious web content, can override an agent's intended directives and cause unauthorised actions. A fundamental obstacle to countering this threat is the absence of a standardised, reproducible evaluation framework; existing studies either demonstrate novel attacks without assessing defences, or propose detection techniques validated only against narrow, private datasets. This paper presents an open-source benchmarking framework that enables fair, comparative evaluation of prompt injection detection methods in the specific operational context of web agents. The framework contributes three primary artefacts: (i) a curated dataset of 5,000 labelled prompts spanning four attack categories—direct keyword injection, direct paraphrase injection, indirect web-page injection, and multi-turn chain attacks—organised under a new webagent-specific taxonomy; (ii) a modular Detector Adapter Hub that allows plug-and-play integration of diverse detection models through a common BaseDetector interface; and (iii) a Benchmark Orchestrator that evaluates all registered detectors against identical inputs and aggregates accuracy, false positive rate, F1-score, and inference latency. Experiments on three representative detectors—a rulebased keyword filter, a GPT-2 perplexity scorer, and Meta's Llama Guard-7B zero-shot classifier—reveal that no single method is universally superior: the rule-based filter achieves sub-millisecond latency but misses 32% of paraphrase and indirect attacks; the perplexity detector attains 90% recall at the cost of a 35% false positive rate; and Llama Guard delivers 94.5% accuracy and a 4% false positive rate but incurs 520 ms perprompt latency. Error analysis motivates a layered defence architecture in which lightweight detectors serve as first-pass filters routing uncertain prompts to more accurate but costlier classifiers.

Downloads

Published

2026-04-02

How to Cite

Dr. T. Veeranna, CH. Sai Maithili, M. Santhoshi, K. Teja Sri, & B. Venkata Naga Anil Sai. (2026). Benchmarking Prompt Injection Detection for Web Agents. International Journal of Data Science and IoT Management System, 5(2), 131-137. https://doi.org/10.64751/ijdim.2026.v5.n2.pp131-137

Similar Articles

31-40 of 729

You may also start an advanced similarity search for this article.