Latest papers

6 papers
benchmark arXiv Feb 18, 2026 · 6w ago

Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

Alexis Winter, Jean-Vincent Martini, Romaric Audigier et al. · Université Paris-Saclay

Unified benchmark evaluating adversarial attacks on object detectors, revealing poor CNN-to-ViT transferability and optimal adversarial training mixes

Input Manipulation Attack vision
PDF
benchmark arXiv Jan 30, 2026 · 9w ago

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu · Independent Researcher · Sorbonne University +2 more

Proposes INP-X benchmark revealing AI image detectors rely on global VAE artifacts, crashing accuracy from 91% to chance level

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Oct 28, 2025 · Oct 2025

PRIVET: Privacy Metric Based on Extreme Value Theory

Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane et al. · Université Paris-Saclay · Universidad Complutense de Madrid +2 more

Proposes PRIVET, a sample-level metric using extreme value theory to detect training data memorization in generative models

Model Inversion Attack visiongenerative
PDF
defense FLLM Oct 16, 2025 · Oct 2025

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

Issam Seddik, Sami Souihi, Mohamed Tamaazousti et al. · Université Paris-Saclay · CEA LIST

Proposes PoTS protocol to catch backdoor injections in LLM training by auditing LM-Head sensitivity at each training step

Model Poisoning Data Poisoning Attack nlp
PDF
defense arXiv Oct 7, 2025 · Oct 2025

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Yanming Li, Cédric Eichler, Nicolas Anciaux et al. · INRIA · INSA CVL +4 more

Embeds invisible Unicode watermarks in training documents to audit whether copyrighted text was used in LLM fine-tuning under black-box access

Output Integrity Attack nlp
PDF
defense arXiv Aug 22, 2025 · Aug 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot · Université Paris-Saclay · CentraleSupélec

Defends LLMs against jailbreaks by using sparse autoencoders to identify interpretable internal activation concepts linked to attack themes

Prompt Injection nlp
PDF