defense 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal 1,2, Céline Hudelot 2

0 citations

α

Published on arXiv

2508.16325

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ConceptGuard provides generalizable jailbreak defenses without fine-tuning by leveraging shared activation geometry across jailbreak attack themes in LLM representation space.

ConceptGuard

Novel technique introduced


Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a range of vulnerabilities, including targeted misuse and accidental user profiling. This work introduces \textbf{ConceptGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, ConceptGuard enables building robust safety guardrails -- offering fully explainable and generalizable defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in the mechanistic interpretability of LLMs, our approach provides evidence for a shared activation geometry for jailbreak attacks in the representation space, a potential foundation for designing more interpretable and generalizable safeguards against attackers.


Key Contributions

  • ConceptGuard framework using Sparse Autoencoders (SAEs) to extract semantically meaningful internal LLM representations associated with jailbreak themes
  • Explainable, fine-tuning-free safety guardrails built from interpretable jailbreak concepts identified in activation space
  • Empirical evidence for a shared activation geometry across diverse jailbreak attack types, suggesting a common internal representation of adversarial intent

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
llm safety guardrailschatbot content moderationjailbreak detection