Latest papers

13 papers
benchmark arXiv Apr 1, 2026 · 5d ago

ClawSafety: "Safe" LLMs, Unsafe Agents

Bowen Wei, Yunbei Zhang, Jinhao Pan et al. · George Mason University · Tulane University +2 more

Benchmark of 120 prompt injection attacks on personal AI agents across skill files, emails, and web content

Prompt Injection Excessive Agency nlpmultimodal
PDF
benchmark arXiv Apr 1, 2026 · 5d ago

Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models

Weidi Luo, Xiaofei Wen, Tenghao Huang et al. · University of Georgia · University of California +3 more

Benchmark and guardrail for detecting jailbreak attacks that bypass LLM safety alignment in food safety domain

Prompt Injection nlp
PDF Code
benchmark arXiv Mar 9, 2026 · 28d ago

Quantifying Memorization and Privacy Risks in Genomic Language Models

Alexander Nemecek, Wenbiao Li, Xiaoqian Jiang et al. · Case Western Reserve University · UTHealth +1 more

Multi-vector framework quantifying memorization, canary extraction, and membership inference risks across genomic language model architectures

Model Inversion Attack Membership Inference Attack nlp
PDF
attack arXiv Feb 27, 2026 · 5w ago

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Wai Tuck Wong, Jun Sun, Arunesh Sinha · Singapore Management University · Rutgers University

Crafts adversarial images inducing numerical instability in VLMs, causing benchmark performance degradation with minimal pixel perturbation

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
defense arXiv Jan 12, 2026 · 12w ago

Self-Creating Random Walks for Decentralized Learning under Pac-Man Attacks

Xingran Chen, Parimal Parag, Rohit Bhagat et al. · Rutgers University · Indian Institute of Science

Defends decentralized random-walk ML against Pac-Man Byzantine nodes that stealthily terminate walks, halting learning without triggering alarms

Data Poisoning Attack federated-learning
PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

Ethan Hsu, Harry Chen, Chudi Zhong et al. · Duke University · MIT +2 more

Analyzes how Rashomon set diversity improves adversarial robustness but increases training data leakage via a proven robustness-privacy trade-off

Input Manipulation Attack Model Inversion Attack tabular
PDF
benchmark arXiv Oct 31, 2025 · Oct 2025

Characterizing Selective Refusal Bias in Large Language Models

Adel Khorramrouz, Sharon Levy · Rutgers University

Reveals demographic-selective bias in LLM safety guardrails and exploits it via indirect jailbreak attacks on refused groups

Prompt Injection nlp
1 citations PDF
benchmark arXiv Oct 5, 2025 · Oct 2025

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Rui Wu, Yihao Quan, Zeru Shi et al. · Rutgers University

Identifies 'consequence-blindness' in LLMs, benchmarks jailbreak and over-refusal failures across semantic/outcome risk mismatches, and fine-tunes defenses with consequence-aware data

Prompt Injection nlp
1 citations 1 influentialPDF Code
attack arXiv Aug 8, 2025 · Aug 2025

ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls

Sanket Badhe · Rutgers University

Autonomous LLM agent bypasses safety guardrails via multi-turn prompt decomposition and roleplay to generate realistic scam call scripts

Prompt Injection Excessive Agency nlpmultimodal
PDF
attack arXiv Aug 6, 2025 · Aug 2025

Automatic LLM Red Teaming

Roman Belaire, Arunesh Sinha, Pradeep Varakantham · Singapore Management University · Rutgers University

Trains an RL agent to conduct multi-turn jailbreak attacks on LLMs by formalizing red teaming as a hierarchical MDP

Prompt Injection nlpreinforcement-learning
PDF
defense arXiv Aug 1, 2025 · Aug 2025

Random Walk Learning and the Pac-Man Attack

Xingran Chen, Parimal Parag, Rohit Bhagat et al. · Rutgers University · Indian Institute of Science

Defends decentralized RW-SGD against stealthy node attacks that kill random walks by duplicating them via the AC algorithm

Data Poisoning Attack federated-learning
PDF
defense arXiv Jan 5, 2025 · Jan 2025

A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models

Yinpeng Cai, Lexin Li, Linjun Zhang · Peking University · University of California +1 more

Watermarks LLM-generated text and uses hypothesis testing to detect if another LLM was trained on copyrighted outputs

Output Integrity Attack nlp
3 citations PDF
defense arXiv Jan 5, 2025 · Jan 2025

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Yang Ouyang, Hengrui Gu, Shuhang Lin et al. · North Carolina State University · Rutgers University +4 more

Defends LLMs against jailbreaks by identifying harmful-token-generating layers and patching them via adversarial unlearning

Prompt Injection nlp
PDF Code