Latest papers

36 papers
attack arXiv Mar 19, 2026 · 18d ago

In-the-Wild Camouflage Attack on Vehicle Detectors through Controllable Image Editing

Xiao Fang, Yiming Gong, Stanislav Panev et al. · Carnegie Mellon University · DEVCOM Army Research Laboratory +1 more

Physical-world camouflage attack synthesizing adversarial vehicle textures via ControlNet fine-tuning, achieving 38% AP50 drop with transferability

Input Manipulation Attack vision
PDF Code
defense arXiv Mar 16, 2026 · 21d ago

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang, Jinxi He, Junyi He et al. · Carnegie Mellon University

Training-free safety framework using self-reflective memory to help VLMs distinguish safe vs unsafe requests in contextually similar scenarios

Prompt Injection multimodalnlpvision
PDF Code
benchmark arXiv Mar 16, 2026 · 21d ago

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu et al. · Gray Swan AI · OpenAI +6 more

Large-scale red teaming competition finds all frontier LLM agents vulnerable to concealed indirect prompt injection attacks with 0.5-8.5% success rates

Prompt Injection Excessive Agency nlpmultimodal
PDF
defense arXiv Mar 5, 2026 · 4w ago

From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Xusheng Luo, Changliu Liu · Carnegie Mellon University

First coupled MILP-based certified robustness framework for keypoint detectors, bounding joint deviation across all keypoints under input perturbations

Input Manipulation Attack vision
PDF
attack arXiv Feb 28, 2026 · 5w ago

Roots Beneath the Cut: Uncovering the Risk of Concept Revival in Pruning-Based Unlearning for Diffusion Models

Ci Zhang, Zhaojun Ding, Chence Yang et al. · University of Georgia · Carnegie Mellon University +3 more

Attacks pruning-based unlearning in diffusion models by reviving erased concepts via side-channel signals from zeroed weight locations

Output Integrity Attack generativevision
PDF
benchmark arXiv Feb 24, 2026 · 5w ago

Personal Information Parroting in Language Models

Nishant Subramani, Kshitish Ghate, Mona Diab · Carnegie Mellon University · University of Washington

Measures verbatim PII leakage from Pythia LLMs via greedy decoding, finding 13.6% reproduction rate scaling with model size and training duration

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
attack arXiv Feb 14, 2026 · 7w ago

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun et al. · University of North Carolina at Chapel Hill · Carnegie Mellon University +2 more

Attacks LLM alignment pipelines by crafting benchmark-compliant rubric edits that systematically bias judge preferences and corrupt RLHF training

Transfer Learning Attack Prompt Injection nlp
PDF Code
benchmark arXiv Feb 13, 2026 · 7w ago

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Yubo Li, Ramayya Krishnan, Rema Padman · Carnegie Mellon University

Benchmarks nine reasoning LLMs against multi-turn natural-language adversarial attacks, identifying five failure modes and exposing confidence-based defense limitations

Prompt Injection nlp
PDF Code
defense arXiv Feb 3, 2026 · 8w ago

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani et al. · Carnegie Mellon University · University of Maryland

Fingerprints LLM outputs to detect unauthorized distillation using gradient-aligned token perturbations that transfer to student models

Model Theft Model Theft nlp
PDF
defense arXiv Jan 26, 2026 · 10w ago

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Kai Hu, Haoqi Hu, Matt Fredrikson · Carnegie Mellon University

Scales 1-Lipschitz certified robustness to billion-parameter vision models via manifold optimization and convolution-free architecture

Input Manipulation Attack vision
PDF
defense arXiv Jan 15, 2026 · 11w ago

Serverless AI Security: Attack Surface Analysis and Runtime Protection Mechanisms for FaaS-Based Machine Learning

Chetan Pathade, Vinod Dhimam, Sheheryar Ahmad et al. · Carnegie Mellon University

Surveys ML attack surfaces on FaaS platforms and proposes Serverless AI Shield detecting 94% of threats with under 9% latency overhead

AI Supply Chain Attacks Model Theft
PDF
attack arXiv Dec 31, 2025 · Dec 2025

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson et al. · Purdue University · Carnegie Mellon University

Engineers a stealthy breaker token that lies dormant in donor LLMs but activates as a trojan after tokenizer transplant into a base model

AI Supply Chain Attacks Model Poisoning nlp
1 citations PDF Code
attack arXiv Dec 18, 2025 · Dec 2025

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al. · Meta Superintelligence Labs · Carnegie Mellon University

Policy-based red teaming framework fine-tunes an attack LLM to generate diverse, human-readable jailbreak prompts achieving SOTA ASR against GPT-4o and Claude 3.5

Prompt Injection nlp
PDF
defense arXiv Dec 3, 2025 · Dec 2025

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block · University of Pennsylvania · Carnegie Mellon University +1 more

Fine-tuning framework that embeds robust watermarks into open-weight LLM weights, closing the quality-detectability gap with inference-time schemes

Output Integrity Attack nlp
PDF Code
benchmark arXiv Nov 28, 2025 · Nov 2025

Are LLMs Good Safety Agents or a Propaganda Engine?

Neemesh Yadav, Francesco Ortu, Jiarui Liu et al. · Southern Methodist University · University of Trieste +6 more

Benchmarks LLM refusal behaviors using prompt injection attacks to distinguish genuine safety guardrails from political censorship

Prompt Injection nlp
PDF
defense arXiv Nov 24, 2025 · Nov 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen et al. · William & Mary · Independent Researcher +2 more

Self-adversarial training framework for unified multimodal models that perturbs shared visual tokens to improve adversarial and OOD robustness

Input Manipulation Attack multimodalvisionnlp
PDF Code
attack arXiv Nov 5, 2025 · Nov 2025

Jailbreaking in the Haystack

Rishi Rajesh Shah, Chen Henry Wu, Shashwat Saxena et al. · Carnegie Mellon University

NINJA jailbreaks long-context LLMs by burying harmful goals in benign haystack content, exploiting positional safety blindspots

Prompt Injection nlp
2 citations PDF
defense arXiv Nov 3, 2025 · Nov 2025

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian et al. · The Hong Kong University of Science and Technology · Hong Kong Baptist University +4 more

Proposes ConV, a generated-image detector exploiting data manifold geometry requiring no generated training samples

Output Integrity Attack visiongenerative
2 citations PDF Code
attack arXiv Oct 29, 2025 · Oct 2025

RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline

André V. Duarte, Xuying li, Bin Zeng et al. · Carnegie Mellon University · Instituto Superior Técnico +1 more

Agentic feedback-loop pipeline extracts memorized copyrighted books from LLMs, improving ROUGE-L by 24% over single-pass extraction

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
Loading more papers…