defense 2025

Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers

Santhosh KumarRavindran

0 citations · 51 references · arXiv

α

Published on arXiv

2510.04528

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 92% prompt injection detection accuracy, 65% reduction in deceptive outputs, and 78% fairness improvement across 700+ experiments on Llama-3.1 (405B), GPT-4o, and Claude-3.5

UTDMF (adversarial activation patching)

Novel technique introduced


The rapid adoption of large language models (LLMs) in enterprise systems exposes vulnerabilities to prompt injection attacks, strategic deception, and biased outputs, threatening security, trust, and fairness. Extending our adversarial activation patching framework (arXiv:2507.09406), which induced deception in toy networks at a 23.9% rate, we introduce the Unified Threat Detection and Mitigation Framework (UTDMF), a scalable, real-time pipeline for enterprise-grade models like Llama-3.1 (405B), GPT-4o, and Claude-3.5. Through 700+ experiments per model, UTDMF achieves: (1) 92% detection accuracy for prompt injection (e.g., jailbreaking); (2) 65% reduction in deceptive outputs via enhanced patching; and (3) 78% improvement in fairness metrics (e.g., demographic bias). Novel contributions include a generalized patching algorithm for multi-threat detection, three groundbreaking hypotheses on threat interactions (e.g., threat chaining in enterprise workflows), and a deployment-ready toolkit with APIs for enterprise integration.


Key Contributions

  • Generalized activation patching algorithm for simultaneous multi-threat detection (prompt injection, deception, bias) in billion-parameter LLMs
  • Three enterprise-applicable hypotheses: Threat Chaining (H1), Activation Forecasting (H2), and Inverse Scaling Safety Law (H3) with associated novel metrics
  • Deployment-ready open-source toolkit with RESTful APIs for integration with Azure ML, AWS SageMaker, and Google Cloud AI

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Datasets
synthetic enterprise datasetsLlama-3.1 405B evaluationGPT-4o evaluationClaude-3.5 evaluation
Applications
enterprise llm securityfinancial auditinghealthcare aifraud detectionmulti-agent workflows