Latest papers

10 papers
benchmark arXiv Apr 1, 2026 · 5d ago

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan et al. · George Mason University · Tulane University +2 more

Benchmark of 120 prompt injection attacks on personal AI agents across skill files, emails, and web content

Prompt Injection Excessive Agency nlpmultimodal
PDF
defense arXiv Apr 1, 2026 · 5d ago

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Zikai Zhang, Rui Hu, Olivera Kotevska et al. · University of Nevada · Oak Ridge National Laboratory

Detects LLM jailbreak attacks using logit distributions over numerical tokens, achieving 22.66% ASR reduction with minimal overhead

Prompt Injection nlp
PDF
attack arXiv Mar 19, 2026 · 18d ago

Automated Membership Inference Attacks: Discovering MIA Signal Computations using LLM Agents

Toan Tran, Olivera Kotevska, Li Xiong · Emory University · Oak Ridge National Laboratory

LLM-agent framework that automatically discovers novel membership inference attack strategies, achieving 0.18 AUC improvement over existing MIAs

Membership Inference Attack
PDF
defense arXiv Feb 23, 2026 · 6w ago

SimLBR: Learning to Detect Fake Images by Learning to Detect Real Images

Aayush Dhakal, Subash Khanal, Srikumar Sastry et al. · Washington University in St. Louis · Oak Ridge National Laboratory

Proposes SimLBR, a latent blending regularization framework for AI-generated image detection with strong cross-generator generalization

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 12, 2026 · 7w ago

Community Concealment from Unsupervised Graph Learning-Based Clustering

Dalyapraz Manatova, Pablo Moriano, L. Jean Camp · Indiana University · Oak Ridge National Laboratory +1 more

Evades GNN community detection by perturbing graph edges and node features to conceal sensitive communities from unsupervised clustering

Input Manipulation Attack graph
PDF
attack arXiv Feb 6, 2026 · 8w ago

The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

Haley Duba-Sullivan, Steven R. Young, Emma J. Reid · Oak Ridge National Laboratory

Trains adversarially-poisoned super-resolution models that silently cause downstream classifier misclassification without any input-level perturbations

Model Poisoning vision
PDF
tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection nlp
1 citations PDF
attack arXiv Nov 17, 2025 · Nov 2025

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew

Farhin Farhad Riya, Shahinul Hoque, Jinyuan Stella Sun et al. · University of Tennessee · Oak Ridge National Laboratory

Federated learning poisoning attack that corrupts Grad-CAM saliency maps via color perturbations while preserving classification accuracy above 96%

Data Poisoning Attack visionfederated-learning
PDF
survey arXiv Oct 21, 2025 · Oct 2025

The Black Tuesday Attack: how to crash the stock market with adversarial examples to financial forecasting models

Thomas Hofweber, Jefrey Bergl, Ian Reyes et al. · University of North Carolina at Chapel Hill · Oak Ridge National Laboratory

Analyzes how adversarial input manipulations to ML financial forecasting models could trigger self-fulfilling, hard-to-detect stock market crashes

Input Manipulation Attack timeseries
PDF
defense arXiv Aug 12, 2025 · Aug 2025

Attacks and Defenses Against LLM Fingerprinting

Kevin Kurian, Ethan Holland, Sean Oesch · Oak Ridge National Laboratory

Improves LLM fingerprinting attacks with RL-optimized query selection and defends with semantic-preserving output filtering to hide model identity

Model Theft Model Theft nlp
PDF