Latest papers

18 papers
defense arXiv Mar 11, 2026 · 28d ago

Backdoor Directions in Vision Transformers

Sengim Karayalcin, Marina Krcek, Pin-Yu Chen et al. · Leiden University · Radboud University +2 more

Identifies causal 'trigger directions' in ViT activations to analyze, remove, and detect backdoors via weight-space interventions

Model Poisoning vision
PDF
benchmark arXiv Mar 11, 2026 · 28d ago

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan et al. · Georgia Institute of Technology · Stanford University +1 more

Exposes LLM unlearning brittleness by showing multi-hop and alias queries recover supposedly forgotten information missed by static benchmarks

Sensitive Information Disclosure nlp
PDF Code
attack arXiv Feb 21, 2026 · 6w ago

LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

Abdullah Caglar Oksuz, Anisa Halimi, Erman Ayday · Case Western Reserve University · IBM Research

Query-efficient label-only membership inference attack that builds a surrogate via model extraction, reducing per-sample query overhead to ~1% of training set size

Membership Inference Attack Model Theft tabular
PDF
benchmark arXiv Feb 7, 2026 · 8w ago

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Xiang Li, Pin-Yu Chen, Wenqi Wei · Fordham University · IBM Research

Red-teaming framework exposing behavioral vulnerabilities in AI voice agents via adversarial speech scenarios across banking, IT support, and logistics

Prompt Injection Excessive Agency audiomultimodalnlp
PDF
benchmark arXiv Feb 3, 2026 · 9w ago

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen et al. · The Chinese University of Hong Kong · IBM Research

Reveals that benign activation steering vectors inadvertently erode LLM safety guardrails, amplifying jailbreak success rates past 80%

Prompt Injection nlp
PDF
benchmark arXiv Jan 14, 2026 · 12w ago

Blue Teaming Function-Calling Agents

Greta Dolcetti, Giulio Zizzo, Sergio Maffeis · Ca’ Foscari University of Venice · IBM Research +1 more

Benchmarks prompt injection and tool poisoning attacks against four open-source function-calling LLMs alongside eight defenses, finding none production-ready

Prompt Injection Insecure Plugin Design nlp
PDF
attack arXiv Dec 18, 2025 · Dec 2025

In-Context Probing for Membership Inference in Fine-Tuned Language Models

Zhexi Lu, Hongliang Chi, Nathalie Baracaldo et al. · Rensselaer Polytechnic Institute · IBM Research +1 more

Attacks fine-tuned LLM privacy via in-context probing to infer training membership without shadow model training

Membership Inference Attack nlp
PDF
defense arXiv Dec 11, 2025 · Dec 2025

MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

Jinhao Zhu, Kevin Tseng, Gil Vernik et al. · University of California · IBM Research

Least privilege framework for LLM tool-calling agents that auto-enforces permission hierarchies to contain unreliable agent behavior

Insecure Plugin Design Excessive Agency nlp
4 citations PDF
attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Nov 11, 2025 · Nov 2025

A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula et al. · IBM Research

Statistically analyzes how prompt perturbations shift jailbreak ASR across SFT, DPO, and RLHF-aligned LLMs, exposing benchmark evaluation gaps

Prompt Injection nlp
PDF
benchmark arXiv Nov 7, 2025 · Nov 2025

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen et al. · University of Minnesota · Michigan State University +1 more

Exposes that all major LLM unlearning methods still leak private/hazardous training data under probabilistic sampling; introduces leak@k metric and RULE defense.

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations PDF
attack arXiv Oct 19, 2025 · Oct 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang, Yiwei Chen, Yihua Zhang et al. · Michigan State University · National University of Singapore +1 more

Backdoors LLM unlearning via attention sink positions so models appear to forget but covertly restore knowledge when triggered

Model Poisoning nlp
1 citations PDF Code
defense arXiv Oct 10, 2025 · Oct 2025

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang, Hang Hua, Yujun Zhou et al. · University of Notre Dame · MIT-IBM Watson AI Lab +3 more

Proposes Safiron, a pre-execution guardrail that detects, categorizes, and explains risky LLM agent action plans before they execute

Excessive Agency nlp
5 citations 1 influentialPDF
benchmark arXiv Oct 8, 2025 · Oct 2025

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Chongyu Fan, Changsheng Wang, Yancheng Huang et al. · Michigan State University · IBM Research

Benchmarks 12 LLM unlearning methods on effectiveness, utility, and robustness to attacks recovering forgotten harmful behaviors

Prompt Injection nlp
PDF
defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp
7 citations PDF
defense arXiv Oct 1, 2025 · Oct 2025

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang, Yihua Zhang, Chongyu Fan et al. · Michigan State University · IBM Research

Shows zeroth-order optimizers produce tamper-resistant LLM unlearning, defending against relearning attacks that restore forgotten harmful or private content

Prompt Injection Sensitive Information Disclosure nlp
PDF
tool arXiv Sep 23, 2025 · Sep 2025

Diversity Boosts AI-Generated Text Detection

Advik Raj Basani, Pin-Yu Chen · Birla Institute of Technology and Science · IBM Research

Detects AI-generated text via surprisal diversity features, outperforming zero-shot baselines by up to 33% with adversarial robustness

Output Integrity Attack nlp
4 citations 1 influentialPDF Code
benchmark arXiv Aug 29, 2025 · Aug 2025

I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

Daryna Oliynyk, Rudolf Mayer, Kathrin Grosse et al. · University of Vienna · SBA Research +2 more

Proposes first comprehensive threat model and evaluation framework for comparing model stealing attacks on image classifiers

Model Theft vision
PDF