SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
Mohammed Himayath Ali , Mohammed Aqib Abdullah , Mohammed Mudassir Uddin , Shahnawaz Alam
Published on arXiv
2601.07835
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
SecureCAI reduces prompt injection attack success rates by 94.7% compared to baseline LLMs while maintaining 95.1% accuracy on benign security analysis tasks and achieving constitution adherence scores above 0.92 under sustained adversarial pressure.
SecureCAI
Novel technique introduced
Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.
Key Contributions
- Formal threat model covering injection attack surfaces in LLM-assisted SOC operations including log poisoning, malicious email content, and obfuscated malware code
- SecureCAI architecture combining security-aware constitutional principles with adaptive evolution responding to emerging attack patterns via continuous red-teaming feedback loops
- DPO-based unlearning methodology that suppresses unsafe response patterns while preserving 95.1% accuracy on benign security analysis tasks, achieving 94.7% reduction in attack success rates