defense 2025

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

0 citations · 56 references · arXiv

Published on arXiv

2509.26345

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SafeBehavior significantly reduces attack success rates across optimization-based, contextual manipulation, and prompt-based jailbreak attacks, outperforming seven state-of-the-art defense baselines.

SafeBehavior

Novel technique introduced

Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.

Key Contributions

SafeBehavior: a three-stage hierarchical jailbreak defense (intention inference → self-introspection → self-revision) inspired by cognitive science models of human decision-making
Adaptive confidence-based output gating that selectively rewrites uncertain responses while preserving benign user intent
Comprehensive evaluation against five jailbreak attack types (including GCG, DeepInception, PAP) and seven defense baselines, demonstrating improved robustness and generalization

🛡️ Threat Analysis

Input Manipulation Attack

The paper defends against gradient-based adversarial suffix attacks such as GCG (Greedy Coordinate Gradient), which craft token-level perturbations at inference time to bypass safety mechanisms — a core ML01 threat.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxblack_boxinference_time

Datasets

AdvBench

Applications

large language model safetychatbotconversational ai

Read PDF arXiv DOI

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

Unifying Adversarial Robustness and Training Across Text Scoring Models

Monotonicity as an Architectural Bias for Robust Language Models

MANATEE: Inference-Time Lightweight Diffusion Based Safety Defense for LLMs

Reinforcement Learning with Backtracking Feedback

BarrierSteer: LLM Safety via Learning Barrier Steering

CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing