ML Security Papers

Latest papers

7 papers

attack arXiv Jan 23, 2026 · 10w ago

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva et al. · University of Sheffield

Attacks fact-checking classifiers with LLM-generated persuasive claim rewrites, collapsing accuracy to near-zero via 15 persuasion techniques

Input Manipulation Attack nlp

PDF

defense arXiv Dec 3, 2025 · Dec 2025

Towards Irreversible Machine Unlearning for Diffusion Models

Xun Yuan, Zilong Zhao, Jiayu Li et al. · National University of Singapore · Betterdata +1 more

Attacks diffusion model unlearning by fine-tuning on auxiliary data to restore suppressed content, then defends with memorization-based unlearning

Output Integrity Attack visiongenerative

2 citations PDF

defense arXiv Nov 27, 2025 · Nov 2025

Rethinking Cross-Generator Image Forgery Detection through DINOv3

Zhenglin Huang, Jason Li, Haiquan Wen et al. · University of Liverpool · Nanyang Technological University +3 more

Discovers frozen DINOv3 detects cross-generator image forgeries via low-frequency cues; proposes training-free token-ranking baseline

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Nov 8, 2025 · Nov 2025

Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation

Yinjie Cheng, Paul Youssef, Christin Seifert et al. · University of Sheffield · Phillips-Universität Marburg

Benchmarks whether malicious LLM knowledge edits survive fine-tuning, finding persistent safety risks across 232 configurations

Transfer Learning Attack Model Poisoning nlp

1 citations PDF

benchmark arXiv Oct 14, 2025 · Oct 2025

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

João A. Leite, Arnav Arora, Silvia Gargova et al. · University of Sheffield · University of Copenhagen +2 more

Red-teams 8 LLMs with persona-targeted disinformation prompts across 4 languages, finding jailbreak rates rise up to 10 percentage points with simple personalisation

Prompt Injection nlp

1 citations 1 influentialPDF

defense arXiv Oct 8, 2025 · Oct 2025

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Anthony Hughes, Vasisht Duddu, N. Asokan et al. · University of Sheffield · University of Waterloo

Defends LLMs against PII extraction attacks by identifying and surgically patching memorization circuits, reducing recall by 65%

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

tool EMNLP Sep 24, 2025 · Sep 2025

Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs

Michiharu Yamashita, Thanh Tran, Delvin Ce Zhang et al. · The Pennsylvania State University · Amazon +1 more

Novel graph-based detection system for LLM-generated fake resume trajectories, outperforming text-based detectors by up to 85%

Output Integrity Attack nlpgraph

3 citations PDF Code

Latest papers

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

Towards Irreversible Machine Unlearning for Diffusion Models

Rethinking Cross-Generator Image Forgery Detection through DINOv3

Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation

A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing

Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue