Latest papers

9 papers
defense arXiv Apr 20, 2026 · 4w ago

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal · University of Virginia

Sparse Autoencoders as inference-time jailbreak defense, achieving 5x attack success reduction via representational bottleneck

Input Manipulation Attack Prompt Injection nlp
PDF
benchmark arXiv Apr 20, 2026 · 4w ago

Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

Ruixuan Liu, David Evans, Li Xiong · Emory University · University of Virginia

Formalizes extraction risk measurement for LLM APIs, showing indistinguishability bounds don't prevent data extraction attacks

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
defense arXiv Feb 24, 2026 · 12w ago

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar et al. · University of Virginia · Capital One

Defends LLMs against jailbreaks by training reasoning-aware refusals via CoT datasets and segment-weighted DPO

Prompt Injection nlp
PDF
defense arXiv Feb 21, 2026 · 12w ago

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo et al. · Zhejiang University · University of Virginia +2 more

Watermarks LLM agent training trajectories so models trained on stolen datasets emit detectable hook behaviors under a secret key

Output Integrity Attack nlpreinforcement-learning
PDF Code
defense arXiv Jan 21, 2026 · Jan 2026

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Saswat Das, Ferdinando Fioretto · University of Virginia

Activation-space guardrail detects privacy-violating LLM agent prompts using linear probes and cumulative drift across multi-turn conversations

Prompt Injection Sensitive Information Disclosure nlp
PDF
defense arXiv Dec 23, 2025 · Dec 2025

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang et al. · Westcliff University · University of Washington +3 more

Defends federated learning against Byzantine poisoning attacks using Shapley-based reputation scores while minimizing multi-cloud communication costs

Data Poisoning Attack federated-learningvision
PDF
defense arXiv Sep 8, 2025 · Sep 2025

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Yan Pang, Wenlong Meng, Xiaojing Liao et al. · University of Virginia · Indiana University Bloomington

Instruments LLMs with trigger-tag associations so phishing-generating models automatically embed detectable markers in harmful outputs

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Aug 12, 2025 · Aug 2025

Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun · University of Virginia

Evaluates four families of character-level special-character attacks on 7 open-source LLMs to bypass safety mechanisms via jailbreaking

Prompt Injection nlp
PDF Code
attack arXiv Aug 10, 2025 · Aug 2025

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Wei Qian, Chenxu Zhao, Yangyi Li et al. · Iowa State University · University of Virginia

Proposes attacks that exploit machine unlearning requests to covertly corrupt model uncertainty estimates without altering predicted labels

Data Poisoning Attack vision
PDF