ML Security Papers

Latest papers

7 papers

survey arXiv Apr 28, 2026 · 23d ago

Verification of Neural Networks (Lecture Notes)

Benedikt Bollig · Université Paris-Saclay · CNRS +1 more

Theoretical introduction to formal verification techniques for neural networks including feed-forward, recurrent, attention, and transformer architectures

Input Manipulation Attack visionnlp

PDF

benchmark arXiv Feb 18, 2026 · Feb 2026

Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

Alexis Winter, Jean-Vincent Martini, Romaric Audigier et al. · Université Paris-Saclay

Unified benchmark evaluating adversarial attacks on object detectors, revealing poor CNN-to-ViT transferability and optimal adversarial training mixes

Input Manipulation Attack vision

PDF

benchmark arXiv Jan 30, 2026 · Jan 2026

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu · Independent Researcher · Sorbonne University +2 more

Proposes INP-X benchmark revealing AI image detectors rely on global VAE artifacts, crashing accuracy from 91% to chance level

Output Integrity Attack visiongenerative

PDF Code

benchmark arXiv Oct 28, 2025 · Oct 2025

PRIVET: Privacy Metric Based on Extreme Value Theory

Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane et al. · Université Paris-Saclay · Universidad Complutense de Madrid +2 more

Proposes PRIVET, a sample-level metric using extreme value theory to detect training data memorization in generative models

Model Inversion Attack visiongenerative

PDF

defense FLLM Oct 16, 2025 · Oct 2025

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

Issam Seddik, Sami Souihi, Mohamed Tamaazousti et al. · Université Paris-Saclay · CEA LIST

Proposes PoTS protocol to catch backdoor injections in LLM training by auditing LM-Head sensitivity at each training step

Model Poisoning Data Poisoning Attack nlp

PDF

defense arXiv Oct 7, 2025 · Oct 2025

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Yanming Li, Cédric Eichler, Nicolas Anciaux et al. · Inria · INSA CVL +4 more

Embeds invisible Unicode watermarks in training documents to audit whether copyrighted text was used in LLM fine-tuning under black-box access

Output Integrity Attack nlp

PDF

defense arXiv Aug 22, 2025 · Aug 2025

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Darpan Aswal, Céline Hudelot · Université Paris-Saclay · CentraleSupélec

Defends LLMs against jailbreaks by using sparse autoencoders to identify interpretable internal activation concepts linked to attack themes

Prompt Injection nlp

PDF

Latest papers

Verification of Neural Networks (Lecture Notes)

Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

PRIVET: Privacy Metric Based on Extreme Value Theory

PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

ConceptGuard: Neuro-Symbolic Safety Guardrails via Sparse Interpretable Jailbreak Concepts

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue