defense 2025

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

0 citations · 29 references · arXiv

Published on arXiv

2510.18131

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BlueCodeAgent achieves an average 12.7% F1 score improvement across four datasets in three code-related safety tasks over base models and safety prompt-based defenses.

BlueCodeAgent

Novel technique introduced

As large language models (LLMs) are increasingly used for code generation, concerns over the security risks have grown substantially. Early research has primarily focused on red teaming, which aims to uncover and evaluate vulnerabilities and risks of CodeGen models. However, progress on the blue teaming side remains limited, as developing defense requires effective semantic understanding to differentiate the unsafe from the safe. To fill in this gap, we propose BlueCodeAgent, an end-to-end blue teaming agent enabled by automated red teaming. Our framework integrates both sides: red teaming generates diverse risky instances, while the blue teaming agent leverages these to detect previously seen and unseen risk scenarios through constitution and code analysis with agentic integration for multi-level defense. Our evaluation across three representative code-related tasks--bias instruction detection, malicious instruction detection, and vulnerable code detection--shows that BlueCodeAgent achieves significant gains over the base models and safety prompt-based defenses. In particular, for vulnerable code detection tasks, BlueCodeAgent integrates dynamic analysis to effectively reduce false positives, a challenging problem as base models tend to be over-conservative, misclassifying safe code as unsafe. Overall, BlueCodeAgent achieves an average 12.7\% F1 score improvement across four datasets in three tasks, attributed to its ability to summarize actionable constitutions that enhance context-aware risk detection. We demonstrate that the red teaming benefits the blue teaming by continuously identifying new vulnerabilities to enhance defense performance.

Key Contributions

End-to-end BlueCodeAgent framework that integrates automated red teaming (to generate diverse risky instances) with a blue teaming agent (to detect seen and unseen risk scenarios) for CodeGen LLMs.
Constitution and code analysis with agentic integration enabling multi-level defense across bias instruction detection, malicious instruction detection, and vulnerable code detection.
Dynamic analysis integration to reduce false positives in vulnerable code detection, achieving an average 12.7% F1 score improvement over base models and safety prompt-based defenses.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Applications

code generationllm-based software development tools

Read PDF arXiv DOI

BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Cross-Service Threat Intelligence in LLM Services using Privacy-Preserving Fingerprints

Incentive-Aligned Multi-Source LLM Summaries

The Cost of Thinking: Increased Jailbreak Risk in Large Language Models

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

Attacks by Content: Automated Fact-checking is an AI Security Issue

EASE: Practical and Efficient Safety Alignment for Small Language Models

LLM Reinforcement in Context

Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token