defense arXiv Feb 1, 2026 · 9w ago
Yassine Abbahaddou, Céline Hudelot, Charlotte Laclau et al. · École Polytechnique · CentraleSupélec +4 more
Defends GNNs against adversarial graph perturbations via orthonormalization and noise-based techniques, alongside representation and generalization contributions
Input Manipulation Attack graph
Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations from structured data. Despite their growing popularity and success across various applications, GNNs encounter several challenges that limit their performance. in their generalization, robustness to adversarial perturbations, and the effectiveness of their representation learning capabilities. In this dissertation, I investigate these core aspects through three main contributions: (1) developing new representation learning techniques based on Graph Shift Operators (GSOs, aiming for enhanced performance across various contexts and applications, (2) introducing generalization-enhancing methods through graph data augmentation, and (3) developing more robust GNNs by leveraging orthonormalization techniques and noise-based defenses against adversarial attacks. By addressing these challenges, my work provides a more principled understanding of the limitations and potential of GNNs.
gnn École Polytechnique · CentraleSupélec · Télécom Paris +3 more
defense arXiv Aug 22, 2025 · Aug 2025
Darpan Aswal, Céline Hudelot · Université Paris-Saclay · CentraleSupélec
Defends LLMs against jailbreaks by using sparse autoencoders to identify interpretable internal activation concepts linked to attack themes
Prompt Injection nlp
Large Language Models have found success in a variety of applications. However, their safety remains a concern due to the existence of various jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a range of vulnerabilities, including targeted misuse and accidental user profiling. This work introduces \textbf{ConceptGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, ConceptGuard enables building robust safety guardrails -- offering fully explainable and generalizable defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in the mechanistic interpretability of LLMs, our approach provides evidence for a shared activation geometry for jailbreak attacks in the representation space, a potential foundation for designing more interpretable and generalizable safeguards against attackers.
llm transformer Université Paris-Saclay · CentraleSupélec