Latest papers

7 papers
defense arXiv Feb 24, 2026 · 5w ago

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu, Vivek V. Datla, Anoop Kumar et al. · University of Virginia · Capital One

Defends LLMs against jailbreaks by training reasoning-aware refusals via CoT datasets and segment-weighted DPO

Prompt Injection nlp
PDF
defense arXiv Feb 21, 2026 · 6w ago

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo et al. · Zhejiang University · University of Virginia +2 more

Watermarks LLM agent training trajectories so models trained on stolen datasets emit detectable hook behaviors under a secret key

Output Integrity Attack nlpreinforcement-learning
PDF Code
defense arXiv Jan 21, 2026 · 10w ago

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

Saswat Das, Ferdinando Fioretto · University of Virginia

Activation-space guardrail detects privacy-violating LLM agent prompts using linear probes and cumulative drift across multi-turn conversations

Prompt Injection Sensitive Information Disclosure nlp
PDF
defense arXiv Dec 23, 2025 · Dec 2025

Cost-TrustFL: Cost-Aware Hierarchical Federated Learning with Lightweight Reputation Evaluation across Multi-Cloud

Jixiao Yang, Jinyu Chen, Zixiao Huang et al. · Westcliff University · University of Washington +3 more

Defends federated learning against Byzantine poisoning attacks using Shapley-based reputation scores while minimizing multi-cloud communication costs

Data Poisoning Attack federated-learningvision
PDF
defense arXiv Sep 8, 2025 · Sep 2025

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Yan Pang, Wenlong Meng, Xiaojing Liao et al. · University of Virginia · Indiana University Bloomington

Instruments LLMs with trigger-tag associations so phishing-generating models automatically embed detectable markers in harmful outputs

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Aug 12, 2025 · Aug 2025

Special-Character Adversarial Attacks on Open-Source Language Model

Ephraiem Sarabamoun · University of Virginia

Evaluates four families of character-level special-character attacks on 7 open-source LLMs to bypass safety mechanisms via jailbreaking

Prompt Injection nlp
PDF Code
attack arXiv Aug 10, 2025 · Aug 2025

Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Wei Qian, Chenxu Zhao, Yangyi Li et al. · Iowa State University · University of Virginia

Proposes attacks that exploit machine unlearning requests to covertly corrupt model uncertainty estimates without altering predicted labels

Data Poisoning Attack vision
PDF