tool 2025

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Chuhan Zhang , Ye Zhang , Bowen Shi , Yuyou Gan , Tianyu Du , Shouling Ji , Dazhan Deng , Yingcai Wu

0 citations

α

Published on arXiv

2509.03985

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

NeuroBreak reveals neuron-level mechanisms underlying diverse jailbreak attacks and offers mechanistic insights to inform next-generation LLM defense strategies.

NeuroBreak

Novel technique introduced


In deployment and application, large language models (LLMs) typically undergo safety alignment to prevent illegal and unethical outputs. However, the continuous advancement of jailbreak attack techniques, designed to bypass safety mechanisms with adversarial prompts, has placed increasing pressure on the security defenses of LLMs. Strengthening resistance to jailbreak attacks requires an in-depth understanding of the security mechanisms and vulnerabilities of LLMs. However, the vast number of parameters and complex structure of LLMs make analyzing security weaknesses from an internal perspective a challenging task. This paper presents NeuroBreak, a top-down jailbreak analysis system designed to analyze neuron-level safety mechanisms and mitigate vulnerabilities. We carefully design system requirements through collaboration with three experts in the field of AI security. The system provides a comprehensive analysis of various jailbreak attack methods. By incorporating layer-wise representation probing analysis, NeuroBreak offers a novel perspective on the model's decision-making process throughout its generation steps. Furthermore, the system supports the analysis of critical neurons from both semantic and functional perspectives, facilitating a deeper exploration of security mechanisms. We conduct quantitative evaluations and case studies to verify the effectiveness of our system, offering mechanistic insights for developing next-generation defense strategies against evolving jailbreak attacks.


Key Contributions

  • NeuroBreak: a top-down jailbreak analysis system providing neuron-level visibility into LLM safety mechanisms and their vulnerabilities
  • Layer-wise representation probing analysis that tracks the model's decision-making process across generation steps during jailbreak attempts
  • Critical neuron identification from both semantic and functional perspectives to expose the internal logic behind safety bypass

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
llm safety analysisjailbreak auditingai security research