Latest papers

24 papers
defense arXiv Mar 27, 2026 · 10d ago

Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models

Zhuan Shi, Alireza Dehghanpour Farashah, Rik de Vries et al. · McGill University · Mila - Québec AI Institute +1 more

Training-free concept erasure for diffusion models that removes unwanted concepts while preserving semantically related neighboring concepts

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 1, 2026 · 5w ago

Tracking Capabilities for Safer Agents

Martin Odersky, Yaoyu Zhao, Yichen Xu et al. · EPFL

Defends LLM agents from prompt injection and data exfiltration using Scala capability-tracking type system as a safety harness

Excessive Agency Insecure Plugin Design nlp
PDF
defense arXiv Feb 20, 2026 · 6w ago

On the Adversarial Robustness of Discrete Image Tokenizers

Rishika Bhagwatkar, Irina Rish, Nicolas Flammarion et al. · Mila - Québec AI Institute · EPFL +1 more

Attacks discrete image tokenizers with adversarial perturbations and defends via unsupervised adversarial training across multimodal tasks

Input Manipulation Attack visionmultimodal
PDF Code
benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Feb 11, 2026 · 7w ago

Optimizing Agent Planning for Security and Autonomy

Aashish Kolluri, Rishi Sharma, Manuel Costa et al. · Microsoft · EPFL +1 more

Defends AI agents against indirect prompt injection via security-aware planning that maximizes autonomous operation without human oversight

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Feb 3, 2026 · 8w ago

Byzantine Machine Learning: MultiKrum and an optimal notion of robustness

Gilles Bareilles, Wassim Bouaziz, Julien Fageot et al. · CMAP École Polytechnique · Mistral AI +1 more

Proves MultiKrum's Byzantine robustness with tight bounds, introducing κ* as an optimal metric for federated aggregation rule security

Data Poisoning Attack federated-learning
PDF
benchmark arXiv Feb 2, 2026 · 9w ago

Membership Inference Attacks from Causal Principles

Mathieu Even, Clément Berenfeld, Linus Bleistein et al. · INRIA · EPFL

Reframes MIA evaluation as causal inference, identifying and correcting systematic biases in one-run and zero-run privacy protocols

Membership Inference Attack nlp
PDF
benchmark arXiv Dec 5, 2025 · Dec 2025

Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models

Ana-Maria Cretu, Klim Kireev, Amro Abdalla et al. · EPFL · MPI-SP +2 more

Evaluates T2I concept filtering defenses against CSAM, showing prompting and fine-tuning attacks bypass even near-perfect child image filtering

Data Poisoning Attack Transfer Learning Attack visiongenerative
PDF
benchmark arXiv Nov 30, 2025 · Nov 2025

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Cen Lu, Yung-Chen Tang, Andrea Cavallaro · EPFL · Idiap Research Institute

Identifies minimal sets of critical neurons in VLMs whose masking causes catastrophic collapse, exposing extreme weight-manipulation vulnerability

Model Poisoning multimodalvisionnlp
PDF
attack arXiv Nov 25, 2025 · Nov 2025

Data Augmentation Techniques to Reverse-Engineer Neural Network Weights from Input-Output Queries

Alexander Beiser, Flavio Martinelli, Wulfram Gerstner et al. · TU Wien · EPFL

Proposes specialized data augmentation strategies that enable black-box extraction of neural network weights at 100× parameter-to-data scale

Model Theft vision
PDF Code
defense arXiv Nov 1, 2025 · Nov 2025

Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang, Tong Zhang, Jianmin Bao et al. · EPFL · Microsoft +1 more

Detects AI-generated fake images by exploiting hierarchical image-text misalignment in CLIP's visual-language space

Output Integrity Attack visionmultimodal
PDF
defense arXiv Nov 1, 2025 · Nov 2025

Enhancing Frequency Forgery Clues for Diffusion-Generated Image Detection

Daichi Zhang, Tong Zhang, Shiming Ge et al. · EPFL · Chinese Academy of Sciences +1 more

Frequency-domain filter enhances forgery clues in Fourier spectrum to detect diffusion-generated images with better generalization

Output Integrity Attack visiongenerative
PDF
attack arXiv Oct 17, 2025 · Oct 2025

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi et al. · EPFL · Archimedes/Athena RC +3 more

Proves LLMs are injective and introduces SipIt to exactly reconstruct private input text from hidden activations

Model Inversion Attack Sensitive Information Disclosure nlp
15 citations 3 influentialPDF
defense arXiv Oct 16, 2025 · Oct 2025

Backdoor Unlearning by Linear Task Decomposition

Amel Abdelraheem, Alessandro Favero, Gerome Bovet et al. · EPFL · armasuisse

Removes backdoors from CLIP foundation models via weight-space task negation, retaining 96% clean accuracy with near-perfect unlearning

Model Poisoning visionmultimodal
PDF
attack arXiv Oct 13, 2025 · Oct 2025

RAG-Pull: Imperceptible Attacks on RAG Systems for Code Generation

Vasilije Stambolic, Aritra Dhar, Lukas Cavigelli · EPFL · Huawei Technologies Switzerland AG

Inserts hidden UTF characters into RAG queries and code repositories to redirect retrieval toward attacker-controlled vulnerable code snippets

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Oct 10, 2025 · Oct 2025

Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols

Mikhail Terekhov, Alexander Panfilov, Daniil Dzenhaliou et al. · MATS · EPFL +4 more

Embeds prompt injections in LLM agent outputs to subvert AI control monitors, collapsing safety-usefulness tradeoffs across protocols

Prompt Injection Excessive Agency nlp
5 citations PDF
defense arXiv Oct 9, 2025 · Oct 2025

Robust and Efficient Collaborative Learning

Abdellah El Mrini, Sadegh Farhadkhan, Rachid Guerraoui · EPFL

Defends decentralized collaborative learning from Byzantine adversaries using epidemic pull-based aggregation scaling O(n log n)

Data Poisoning Attack federated-learning
PDF
defense arXiv Oct 3, 2025 · Oct 2025

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Fatmazohra Rezkellah, Ramzi Dakhmouche · Université Paris-Dauphine · EPFL +1 more

Defends LLMs against jailbreaking and unlearns sensitive content via minimal constrained weight interventions, no classifier required

Prompt Injection Sensitive Information Disclosure nlp
2 citations PDF
defense arXiv Sep 30, 2025 · Sep 2025

Robust Federated Inference

Akash Dhasade, Sadegh Farhadkhani, Rachid Guerraoui et al. · EPFL · University of Copenhagen +1 more

Defends federated inference aggregators against Byzantine clients using DeepSet adversarial training, beating existing methods by up to 22%

Data Poisoning Attack federated-learningvisionnlp
1 citations PDF
attack arXiv Sep 19, 2025 · Sep 2025

Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE

Xinpeng Liu, Junming Liu, Peiyu Liu et al. · Zhejiang University · EPFL

Hijacks LLM coding agents by embedding malicious payloads in config files, achieving persistent stealthy execution across nine AI-IDEs

Prompt Injection Insecure Plugin Design nlp
PDF
Loading more papers…