attack 2025

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

0 citations · 49 references · arXiv

Published on arXiv

2510.17862

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

FCV-Attack achieves a 40.7% attack success rate on GPT-5 Mini + OpenHands for CWE-538 using only black-box access and a single query, exposing all 12 tested agent-model combinations.

FCV-Attack

Novel technique introduced

Code agents are increasingly trusted to autonomously fix bugs on platforms such as GitHub, yet their security evaluation focuses almost exclusively on functional correctness. In this paper, we reveal a novel type of threat to real-world code agents: Functionally Correct yet Vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. With our proposed FCV-Attack, which can be deliberately crafted by malicious attackers or implicitly introduced by benign developers, we show that SOTA LLMs (e.g., ChatGPT and Claude) and agent scaffolds (e.g., SWE-agent and OpenHands) are all vulnerable to this FCV threat; across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack. For example, for CWE-538 (information exposure vulnerability), the FCV-Attack attains an attack success rate of $40.7\%$ on GPT-5 Mini + OpenHands. Our results reveal an important security threat overlooked by current evaluation paradigms and urge the development of security-aware defenses for code agents.

Key Contributions

Identifies Functionally Correct yet Vulnerable (FCV) patches as a novel, previously overlooked security threat to LLM-based code agents
Proposes FCV-Attack, a black-box single-query attack that achieves up to 40.7% attack success rate across 12 agent-model combinations on SWE-Bench
Demonstrates that functional correctness-focused evaluation paradigms are insufficient and calls for security-aware defenses for code agents

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

SWE-Bench

Applications

automated bug fixingcode agentssoftware engineering agents

Read PDF arXiv DOI

When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Adversarial Bug Reports as a Security Risk in Language Model-Based Automated Program Repair

PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

PINA: Prompt Injection Attack against Navigation Agents

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking