Latest papers

7 papers
attack arXiv Feb 3, 2026 · 8w ago

Controlling Output Rankings in Generative Engines for LLM-based Search

Haibo Jin, Ruoxi Chen, Peiyan Zhang et al. · University of Illinois at Urbana-Champaign · Starc Institute +2 more

Injects crafted content into product pages to manipulate LLM-based search rankings with 91% promotion success rate

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp
PDF
defense arXiv Jan 20, 2026 · 10w ago

FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

Qian Feng, JiaHang Tu, Mintong Kang et al. · Zhejiang University · University of Illinois at Urbana-Champaign

Defends against residual training-data recovery in incremental unlearning via dual orthogonal constraints on features and gradients

Model Inversion Attack vision
3 citations 1 influentialPDF
defense arXiv Jan 7, 2026 · 12w ago

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Siyuan Li, Xi Lin, Jun Wu et al. · Shanghai Jiao Tong University · University of Illinois at Urbana-Champaign +1 more

Deceptive multi-agent defense that lures LLM jailbreak attackers into honeypot traps, reducing attack success by 68.77% while draining attacker resources

Prompt Injection nlp
PDF
attack arXiv Dec 30, 2025 · Dec 2025

GCG Attack On A Diffusion LLM

Ruben Neyroud, Sam Corley · University of Illinois at Urbana-Champaign

Adapts gradient-based GCG adversarial attacks to LLaDA diffusion LLM, exploring prefix and suffix variants to elicit harmful outputs

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Sep 24, 2025 · Sep 2025

Efficiently Attacking Memorization Scores

Tue Do, Varun Chandrasekaran, Daniel Alabi · University of Illinois at Urbana-Champaign

Attacks memorization score estimators via pseudoinverse inputs that inflate influence scores using only black-box model access

Input Manipulation Attack vision
PDF Code
tool arXiv Aug 28, 2025 · Aug 2025

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang et al. · University of Illinois at Urbana-Champaign · Starc Institute +1 more

Automated LLM red-teaming tool translates government AI ethics guidelines into jailbreak diagnostics and compliance reports

Prompt Injection nlpmultimodal
PDF