Latest papers

11 papers
defense arXiv Mar 18, 2026 · 19d ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Haozheng Luo, Yimin Wang, Jiahao Yu et al. · Northwestern University · University of Michigan +1 more

Aligns reasoning models against jailbreaks by optimizing safety in hidden representation space using contrastive RL

Prompt Injection nlp
PDF
benchmark arXiv Mar 15, 2026 · 22d ago

Human-AI Ensembles Improve Deepfake Detection in Low-to-Medium Quality Videos

Marco Postiglione, Isabel Gortner, V.S. Subrahmanian · Northwestern University

Humans outperform 95 AI deepfake detectors on mobile-quality videos, with human-AI ensembles reducing high-confidence errors

Output Integrity Attack visionmultimodal
PDF
attack arXiv Jan 28, 2026 · 9w ago

ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack

Xingwei Lin, Wenhao Lin, Sicong Cao et al. · Zhejiang University · Nanjing University of Posts and Telecommunications +2 more

Exploits intent-context coupling in multi-turn jailbreaks to bypass LLM safety with 97.1% attack success rate

Prompt Injection nlp
PDF Code
defense arXiv Jan 19, 2026 · 11w ago

Context and Transcripts Improve Detection of Deepfake Audios of Public Figures

Chongyang Gao, Marco Postiglione, Julian Baldwin et al. · Northwestern University · Bar-Ilan University

Novel context-aware audio deepfake detector boosts F1-score up to 37% and resists 5 adversarial evasion strategies for public-figure impersonation

Output Integrity Attack audiomultimodalnlp
PDF Code
defense arXiv Dec 26, 2025 · Dec 2025

LLA: Enhancing Security and Privacy for Generative Models with Logic-Locked Accelerators

You Li, Guannan Zhao, Yuhao Ju et al. · Northwestern University

Logic-locked hardware accelerators enforce model licensing for generative AI, resisting oracle-guided key attacks with under 0.1% overhead

Model Theft AI Supply Chain Attacks generative
PDF
defense arXiv Dec 21, 2025 · Dec 2025

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Junjun Pan, Yixin Liu, Rui Miao et al. · Griffith University · Jilin University +1 more

Defends LLM multi-agent systems by detecting malicious agents using bi-level graph anomaly detection with token-level explainability

Excessive Agency nlpgraph
1 citations PDF
defense arXiv Oct 2, 2025 · Oct 2025

AdvEvo-MARL: Shaping Internalized Safety through Adversarial Co-Evolution in Multi-Agent Reinforcement Learning

Zhenyu Pan, Yiting Zhang, Zhuo Liu et al. · Northwestern University · University of Illinois at Chicago +2 more

Adversarial co-evolution MARL framework that trains LLM agents to resist jailbreaks and prompt injection without external guard modules

Prompt Injection Excessive Agency nlpreinforcement-learning
1 citations PDF
defense arXiv Sep 24, 2025 · Sep 2025

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Wenhan Wu, Zheyuan Liu, Chongyang Gao et al. · Northwestern University · University of Notre Dame +1 more

Hardens LLM unlearning against relearning attacks by steering parameters toward flat loss minima via adversarial neighborhood-aware optimization

Sensitive Information Disclosure Prompt Injection nlp
1 citations PDF
defense arXiv Aug 28, 2025 · Aug 2025

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

Weitao Feng, Lixu Wang, Tianyi Wei et al. · Nanyang Technological University · A*STAR +1 more

Defends LLM safety alignment against RL fine-tuning attacks by suppressing response entropy via TokenBuncher

Transfer Learning Attack Prompt Injection nlpreinforcement-learning
PDF
survey arXiv Aug 20, 2025 · Aug 2025

A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

Kaixiang Zhao, Lincan Li, Kaize Ding et al. · University of Notre Dame · Florida State University +3 more

Surveys model extraction attacks and defenses across MLaaS platforms, proposing a taxonomy of attack mechanisms and computing environments

Model Theft visionnlptabular
PDF Code
defense arXiv Aug 5, 2025 · Aug 2025

Evo-MARL: Co-Evolutionary Multi-Agent Reinforcement Learning for Internalized Safety

Zhenyu Pan, Yiting Zhang, Yutong Zhang et al. · Northwestern University · University of Illinois at Chicago

Defends LLM multi-agent systems against jailbreaks by co-evolving attackers and defenders via MARL, internalizing safety without external guard modules

Prompt Injection Excessive Agency multimodalreinforcement-learningnlp
PDF