defense 2025

Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers

Santhosh KumarRavindran

Microsoft Corporation

0 citations · 51 references · arXiv

Published on arXiv

2510.04528

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves 92% prompt injection detection accuracy, 65% reduction in deceptive outputs, and 78% fairness improvement across 700+ experiments on Llama-3.1 (405B), GPT-4o, and Claude-3.5

UTDMF (adversarial activation patching)

Novel technique introduced

The rapid adoption of large language models (LLMs) in enterprise systems exposes vulnerabilities to prompt injection attacks, strategic deception, and biased outputs, threatening security, trust, and fairness. Extending our adversarial activation patching framework (arXiv:2507.09406), which induced deception in toy networks at a 23.9% rate, we introduce the Unified Threat Detection and Mitigation Framework (UTDMF), a scalable, real-time pipeline for enterprise-grade models like Llama-3.1 (405B), GPT-4o, and Claude-3.5. Through 700+ experiments per model, UTDMF achieves: (1) 92% detection accuracy for prompt injection (e.g., jailbreaking); (2) 65% reduction in deceptive outputs via enhanced patching; and (3) 78% improvement in fairness metrics (e.g., demographic bias). Novel contributions include a generalized patching algorithm for multi-threat detection, three groundbreaking hypotheses on threat interactions (e.g., threat chaining in enterprise workflows), and a deployment-ready toolkit with APIs for enterprise integration.

Key Contributions

Generalized activation patching algorithm for simultaneous multi-threat detection (prompt injection, deception, bias) in billion-parameter LLMs
Three enterprise-applicable hypotheses: Threat Chaining (H1), Activation Forecasting (H2), and Inverse Scaling Safety Law (H3) with associated novel metrics
Deployment-ready open-source toolkit with RESTful APIs for integration with Azure ML, AWS SageMaker, and Google Cloud AI

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

synthetic enterprise datasetsLlama-3.1 405B evaluationGPT-4o evaluationClaude-3.5 evaluation

Applications

enterprise llm securityfinancial auditinghealthcare aifraud detectionmulti-agent workflows

Read PDF arXiv DOI

Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents

Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

A2AS: Agentic AI Runtime Security and Self-Defense

Agent-Sentry: Bounding LLM Agents via Execution Provenance

Optimizing Agent Planning for Security and Autonomy

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening