ML Security Papers

Stats

Latest papers

7 papers

attack arXiv Feb 3, 2026 · 8w ago

Controlling Output Rankings in Generative Engines for LLM-based Search

Haibo Jin, Ruoxi Chen, Peiyan Zhang et al. · University of Illinois at Urbana-Champaign · Starc Institute +2 more

Injects crafted content into product pages to manipulate LLM-based search rankings with 91% promotion success rate

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp

PDF

defense arXiv Jan 20, 2026 · 10w ago

FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

Qian Feng, JiaHang Tu, Mintong Kang et al. · Zhejiang University · University of Illinois at Urbana-Champaign

Defends against residual training-data recovery in incremental unlearning via dual orthogonal constraints on features and gradients

Model Inversion Attack vision

3 citations 1 influentialPDF

defense arXiv Jan 7, 2026 · 12w ago

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Siyuan Li, Xi Lin, Jun Wu et al. · Shanghai Jiao Tong University · University of Illinois at Urbana-Champaign +1 more

Deceptive multi-agent defense that lures LLM jailbreak attackers into honeypot traps, reducing attack success by 68.77% while draining attacker resources

Prompt Injection nlp

PDF

attack arXiv Dec 30, 2025 · Dec 2025

GCG Attack On A Diffusion LLM

Ruben Neyroud, Sam Corley · University of Illinois at Urbana-Champaign

Adapts gradient-based GCG adversarial attacks to LLaDA diffusion LLM, exploring prefix and suffix variants to elicit harmful outputs

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Sep 24, 2025 · Sep 2025

Efficiently Attacking Memorization Scores

Tue Do, Varun Chandrasekaran, Daniel Alabi · University of Illinois at Urbana-Champaign

Attacks memorization score estimators via pseudoinverse inputs that inflate influence scores using only black-box model access

Input Manipulation Attack vision

PDF Code

tool arXiv Aug 28, 2025 · Aug 2025

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang et al. · University of Illinois at Urbana-Champaign · Starc Institute +1 more

Automated LLM red-teaming tool translates government AI ethics guidelines into jailbreak diagnostics and compliance reports

Prompt Injection nlpmultimodal

PDF

As Large Language Models become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (\textbf{G}uideline \textbf{U}pholding Test through \textbf{A}daptive \textbf{R}ole-play and Jailbreak \textbf{D}iagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We have empirically validated the effectiveness of GUARD on seven LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models, demonstrating its usage in promoting reliable LLM-based applications.

llm vlm University of Illinois at Urbana-Champaign · Starc Institute · Hong Kong University of Science and Technology

PDF arXiv

Latest papers

Controlling Output Rankings in Generative Engines for LLM-based Search

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

GCG Attack On A Diffusion LLM

Efficiently Attacking Memorization Scores

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue