defense 2025

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

Jiaming Hu 1, Haoyu Wang 2, Debarghya Mukherjee 2, Ioannis Ch. Paschalidis 2

0 citations

α

Published on arXiv

2508.14128

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CCFC reduces attack success rates by 50–75% compared to state-of-the-art defenses against strong adversaries including DeepInception and GCG without sacrificing benign query response quality.

CCFC (Core & Core-Full-Core)

Novel technique introduced


Jailbreak attacks pose a serious challenge to the safe deployment of large language models (LLMs). We introduce CCFC (Core & Core-Full-Core), a dual-track, prompt-level defense framework designed to mitigate LLMs' vulnerabilities from prompt injection and structure-aware jailbreak attacks. CCFC operates by first isolating the semantic core of a user query via few-shot prompting, and then evaluating the query using two complementary tracks: a core-only track to ignore adversarial distractions (e.g., toxic suffixes or prefix injections), and a core-full-core (CFC) track to disrupt the structural patterns exploited by gradient-based or edit-based attacks. The final response is selected based on a safety consistency check across both tracks, ensuring robustness without compromising on response quality. We demonstrate that CCFC cuts attack success rates by 50-75% versus state-of-the-art defenses against strong adversaries (e.g., DeepInception, GCG), without sacrificing fidelity on benign queries. Our method consistently outperforms state-of-the-art prompt-level defenses, offering a practical and effective solution for safer LLM deployment.


Key Contributions

  • Dual-track CCFC framework that isolates the semantic core of user queries via few-shot prompting to strip adversarial distractions before safety evaluation
  • Core-only track ignores toxic suffixes/prefix injections; CFC track disrupts structural patterns exploited by gradient-based (GCG) and edit-based attacks
  • Safety consistency check across both tracks reduces attack success rates by 50–75% against strong adversaries while preserving benign query fidelity

🛡️ Threat Analysis

Input Manipulation Attack

The CFC track explicitly defends against gradient-based adversarial suffix attacks (GCG), which are token-level perturbation attacks falling under ML01. The paper directly evaluates and mitigates these attacks as a primary threat model.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxblack_boxinference_time
Datasets
AdvBench
Applications
llm deployment safetychatbot safetyjailbreak mitigation