Latest papers

16 papers
survey arXiv Mar 31, 2026 · 6d ago

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Almasoud, Antony Anju, Marco Arazzi et al. · arXiv · University of Pavia +1 more

First comprehensive survey organizing 45 studies on security risks of LLM-as-a-Judge systems including adversarial manipulation and evaluation vulnerabilities

Prompt Injection nlp
PDF
attack arXiv Mar 10, 2026 · 27d ago

CLIOPATRA: Extracting Private Information from LLM Insights

Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, Peter Kairouz · arXiv · University College London +1 more

Attacks Anthropic's Clio LLM analytics platform by injecting crafted chats to extract private medical history of target users, bypassing layered privacy protections

Sensitive Information Disclosure Prompt Injection nlp
PDF Code
defense arXiv Mar 2, 2026 · 5w ago

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu et al. · arXiv · The Hong Kong University of Science and Technology +1 more

Detects backdoor and prompt injection attacks in black-box LLMs by monitoring token entropy lulls during generation

Model Poisoning Prompt Injection nlp
PDF Code
defense arXiv Feb 21, 2026 · 6w ago

Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs

Chengwei Xia, Fan Ma, Ruijie Quan et al. · Lanzhou University · arXiv +2 more

Adversarially-optimized trigger images that verify MLLM copyright by eliciting ownership text only in fine-tuned derivatives

Model Theft Model Theft multimodalnlp
PDF
defense arXiv Feb 12, 2026 · 7w ago

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Oliver Zahn, Matt Beton, Simran Chana · arXiv · Independent Researcher +1 more

Defends against data poisoning via contributor-reputation-weighted training, outperforming Byzantine-robust baselines under joint credential-faking and gradient-alignment attacks

Data Poisoning Attack tabularfederated-learning
PDF
attack arXiv Dec 9, 2025 · Dec 2025

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Sampriti Soor, Suklav Ghosh, Arijit Sur · arXiv · Indian Institute of Technology Guwahati

RL-trained adversarial suffixes degrade LLM classification accuracy using PPO and calibrated cross-entropy, outperforming gradient-based triggers in transferability

Input Manipulation Attack nlp
PDF
benchmark arXiv Dec 1, 2025 · Dec 2025

Securing Large Language Models (LLMs) from Prompt Injection Attacks

Omar Farooq Khan Suri, John McCrae · arXiv · University of Galway

Evaluates JATMO fine-tuning defense against HOUYI genetic prompt injection attacks on LLaMA 2 and Qwen, finding incomplete protection

Prompt Injection nlp
PDF
defense arXiv Nov 20, 2025 · Nov 2025

PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization

Huseein Jawad, Nicolas Brunel · arXiv · Capgemini Invent +2 more

Defends LLM system prompts against extraction attacks by appending optimized textual shields via black-box LLM-guided optimization

Sensitive Information Disclosure Prompt Injection nlp
PDF Code
attack arXiv Nov 13, 2025 · Nov 2025

Enhanced Privacy Leakage from Noise-Perturbed Gradients via Gradient-Guided Conditional Diffusion Models

Jiayang Meng, Tao Huang, Hong Chen et al. · arXiv · Renmin University of China +1 more

Diffusion model-guided gradient inversion attack that reconstructs private images from noise-perturbed FL gradients, bypassing a common defense

Model Inversion Attack visionfederated-learning
PDF
tool arXiv Oct 21, 2025 · Oct 2025

Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing

Chia-Hsuan Lu, Tony Tan, Michael Benedikt · arXiv · University of Oxford +1 more

Verifies GNN robustness against structural adversarial perturbations using polynomial-time partial SAT solvers instead of MIP

Input Manipulation Attack graph
1 citations PDF Code
defense arXiv Oct 20, 2025 · Oct 2025

Fair and Interpretable Deepfake Detection in Videos

Akihito Yoshii, Ryosuke Sonoda, Ramya Srinivasan · arXiv · Fujitsu Limited

Fairness-aware deepfake video detector combining temporal clustering, concept explainability, and frequency-domain augmentation to reduce demographic bias

Output Integrity Attack vision
PDF
benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp
8 citations 2 influentialPDF Code
defense arXiv Sep 29, 2025 · Sep 2025

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang et al. · arXiv · Shanghai Qi Zhi Institute +3 more

Defends Large Reasoning Models against jailbreaks by aligning CoT safety via process-supervised preference optimization with corrective interventions

Prompt Injection nlp
2 citations 1 influentialPDF
defense arXiv Sep 17, 2025 · Sep 2025

Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

Benjamin Sterling, Yousef El-Laham, Mónica F. Bugallo · arXiv · Stony Brook University

Defends diffusion models against membership inference attacks using critically-damped higher-order Langevin dynamics to corrupt sensitive training data earlier in the diffusion process

Membership Inference Attack generativeaudio
PDF
benchmark arXiv Aug 28, 2025 · Aug 2025

FakeParts: a New Family of AI-Generated DeepFakes

Ziyi Liu, Firas Gabetni, Awais Hussain Sani et al. · arXiv · Institut Polytechnique de Paris

Benchmark dataset of 81K partial deepfake videos exposing critical blind spots in state-of-the-art deepfake detectors

Output Integrity Attack visiongenerative
PDF
defense arXiv Aug 8, 2025 · Aug 2025

Quantifying Conversation Drift in MCP via Latent Polytope

Haoran Shi, Hongwei Yao, Shuo Shao et al. · arXiv · Zhejiang University +3 more

Defends LLM-MCP tool integrations against indirect prompt injection by detecting adversarial conversation drift in latent polytope space

Insecure Plugin Design Prompt Injection nlp
PDF