ML Security Papers

Latest papers

18 papers

defense arXiv Mar 11, 2026 · 28d ago

Backdoor Directions in Vision Transformers

Sengim Karayalcin, Marina Krcek, Pin-Yu Chen et al. · Leiden University · Radboud University +2 more

Identifies causal 'trigger directions' in ViT activations to analyze, remove, and detect backdoors via weight-space interventions

Model Poisoning vision

PDF

benchmark arXiv Mar 11, 2026 · 28d ago

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan et al. · Georgia Institute of Technology · Stanford University +1 more

Exposes LLM unlearning brittleness by showing multi-hop and alias queries recover supposedly forgotten information missed by static benchmarks

Sensitive Information Disclosure nlp

PDF Code

attack arXiv Feb 21, 2026 · 6w ago

LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

Abdullah Caglar Oksuz, Anisa Halimi, Erman Ayday · Case Western Reserve University · IBM Research

Query-efficient label-only membership inference attack that builds a surrogate via model extraction, reducing per-sample query overhead to ~1% of training set size

Membership Inference Attack Model Theft tabular

PDF

benchmark arXiv Feb 7, 2026 · 8w ago

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Xiang Li, Pin-Yu Chen, Wenqi Wei · Fordham University · IBM Research

Red-teaming framework exposing behavioral vulnerabilities in AI voice agents via adversarial speech scenarios across banking, IT support, and logistics

Prompt Injection Excessive Agency audiomultimodalnlp

PDF

benchmark arXiv Feb 3, 2026 · 9w ago

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Chen Xiong, Zhiyuan He, Pin-Yu Chen et al. · The Chinese University of Hong Kong · IBM Research

Reveals that benign activation steering vectors inadvertently erode LLM safety guardrails, amplifying jailbreak success rates past 80%

Prompt Injection nlp

PDF

benchmark arXiv Jan 14, 2026 · 12w ago

Blue Teaming Function-Calling Agents

Greta Dolcetti, Giulio Zizzo, Sergio Maffeis · Ca’ Foscari University of Venice · IBM Research +1 more

Benchmarks prompt injection and tool poisoning attacks against four open-source function-calling LLMs alongside eight defenses, finding none production-ready

Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Dec 18, 2025 · Dec 2025

In-Context Probing for Membership Inference in Fine-Tuned Language Models

Zhexi Lu, Hongliang Chi, Nathalie Baracaldo et al. · Rensselaer Polytechnic Institute · IBM Research +1 more

Attacks fine-tuned LLM privacy via in-context probing to infer training membership without shadow model training

Membership Inference Attack nlp

PDF

defense arXiv Dec 11, 2025 · Dec 2025

MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

Jinhao Zhu, Kevin Tseng, Gil Vernik et al. · University of California · IBM Research

Least privilege framework for LLM tool-calling agents that auto-enforces permission hierarchies to contain unreliable agent behavior

Insecure Plugin Design Excessive Agency nlp

4 citations PDF

attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp

1 citations PDF Code

benchmark arXiv Nov 11, 2025 · Nov 2025

A methodological analysis of prompt perturbations and their effect on attack success rates

Tiago Machado, Maysa Malfiza Garcia de Macedo, Rogerio Abreu de Paula et al. · IBM Research

Statistically analyzes how prompt perturbations shift jailbreak ASR across SFT, DPO, and RLHF-aligned LLMs, exposing benchmark evaluation gaps

Prompt Injection nlp

PDF

benchmark arXiv Nov 7, 2025 · Nov 2025

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen et al. · University of Minnesota · Michigan State University +1 more

Exposes that all major LLM unlearning methods still leak private/hazardous training data under probabilistic sampling; introduces leak@k metric and RULE defense.

Model Inversion Attack Sensitive Information Disclosure nlp

1 citations PDF

attack arXiv Oct 19, 2025 · Oct 2025

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Bingqi Shang, Yiwei Chen, Yihua Zhang et al. · Michigan State University · National University of Singapore +1 more

Backdoors LLM unlearning via attention sink positions so models appear to forget but covertly restore knowledge when triggered

Model Poisoning nlp

1 citations PDF Code

defense arXiv Oct 10, 2025 · Oct 2025

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

Yue Huang, Hang Hua, Yujun Zhou et al. · University of Notre Dame · MIT-IBM Watson AI Lab +3 more

Proposes Safiron, a pre-execution guardrail that detects, categorizes, and explains risky LLM agent action plans before they execute

Excessive Agency nlp

5 citations 1 influentialPDF

benchmark arXiv Oct 8, 2025 · Oct 2025

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Chongyu Fan, Changsheng Wang, Yancheng Huang et al. · Michigan State University · IBM Research

Benchmarks 12 LLM unlearning methods on effectiveness, utility, and robustness to attacks recovering forgotten harmful behaviors

Prompt Injection nlp

PDF

defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp

7 citations PDF

defense arXiv Oct 1, 2025 · Oct 2025

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Yicheng Lang, Yihua Zhang, Chongyu Fan et al. · Michigan State University · IBM Research

Shows zeroth-order optimizers produce tamper-resistant LLM unlearning, defending against relearning attacks that restore forgotten harmful or private content

Prompt Injection Sensitive Information Disclosure nlp

PDF

tool arXiv Sep 23, 2025 · Sep 2025

Diversity Boosts AI-Generated Text Detection

Advik Raj Basani, Pin-Yu Chen · Birla Institute of Technology and Science · IBM Research

Detects AI-generated text via surprisal diversity features, outperforming zero-shot baselines by up to 33% with adversarial robustness

Output Integrity Attack nlp

4 citations 1 influentialPDF Code

benchmark arXiv Aug 29, 2025 · Aug 2025

I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

Daryna Oliynyk, Rudolf Mayer, Kathrin Grosse et al. · University of Vienna · SBA Research +2 more

Proposes first comprehensive threat model and evaluation framework for comparing model stealing attacks on image classifiers

Model Theft vision

PDF

Latest papers

Backdoor Directions in Vision Transformers

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

LoMime: Query-Efficient Membership Inference using Model Extraction in Label-Only Settings

Aegis: Towards Governance, Integrity, and Security of AI Voice Agents

Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models

Blue Teaming Function-Calling Agents

In-Context Probing for Membership Inference in Fine-Tuned Language Models

MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

A methodological analysis of prompt perturbations and their effect on attack success rates

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Forgetting to Forget: Attention Sink as A Gateway for Backdooring LLM Unlearning

Building a Foundational Guardrail for General Agentic Systems via Synthetic Data

LLM Unlearning Under the Microscope: A Full-Stack View on Methods and Metrics

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Diversity Boosts AI-Generated Text Detection

I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue