Latest papers

27 papers
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Feb 12, 2026 · 7w ago

ANML: Attribution-Native Machine Learning with Guaranteed Robustness

Oliver Zahn, Matt Beton, Simran Chana · arXiv · Independent Researcher +1 more

Defends against data poisoning via contributor-reputation-weighted training, outperforming Byzantine-robust baselines under joint credential-faking and gradient-alignment attacks

Data Poisoning Attack tabularfederated-learning
PDF
defense arXiv Feb 11, 2026 · 7w ago

Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection

J Alex Corll · Independent Researcher

Proposes a proxy-level scoring formula combining peak risk and persistence to detect multi-turn LLM jailbreaks without LLM inference

Prompt Injection nlp
PDF Code
tool arXiv Feb 7, 2026 · 8w ago

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Kunal Pai, Parth Shah, Harshil Patel · University of California · Independent Researcher

Evolutionary framework auto-generates and mutates adversarial prompts to uncover LLM agent jailbreaks missed by static red-teaming

Prompt Injection nlp
PDF Code
defense arXiv Feb 7, 2026 · 8w ago

UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Jiaming He, Fuming Luo, Hongwei Li et al. · University of Electronic Science and Technology of China · Independent Researcher +2 more

Protects private tabular data from unauthorized training by injecting decoupled shortcut perturbations that drive models to near-random performance

Data Poisoning Attack tabular
PDF
benchmark arXiv Feb 2, 2026 · 9w ago

Expected Harm: Rethinking Safety Evaluation of (Mis)Aligned LLMs

Yen-Shan Chen, Zhi Rui Tam, Cheng-Kuang Wu et al. · National Taiwan University · Independent Researcher

Reveals LLM safety miscalibration via Expected Harm metric, boosting existing jailbreak success rates by up to 2×

Prompt Injection nlp
PDF
defense arXiv Feb 2, 2026 · 9w ago

Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang et al. · Institute for Infocomm Research (I2R) · Independent Researcher +1 more

Post-hoc Bayesian calibration framework fixes systematic bias in AI-generated image detectors under distribution shift without retraining

Output Integrity Attack visiongenerative
PDF Code
benchmark arXiv Jan 30, 2026 · 9w ago

AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu · Independent Researcher · Sorbonne University +2 more

Proposes INP-X benchmark revealing AI image detectors rely on global VAE artifacts, crashing accuracy from 91% to chance level

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Jan 29, 2026 · 9w ago

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue, Yi Chai, Yanzhen Ren et al. · Wuhan University · Independent Researcher +3 more

Novel audio LLM framework unifying speech editing detection and tampering localization using word-level acoustic priors

Output Integrity Attack audionlp
1 citations PDF
attack arXiv Jan 19, 2026 · 11w ago

On the Evidentiary Limits of Membership Inference for Copyright Auditing

Murat Bilgehan Ertan, Emirhan Böge, Min Chen et al. · Centrum Wiskunde & Informatica · Vrije Universiteit Amsterdam +2 more

SAGE paraphrasing framework defeats membership inference attacks on LLMs by rewriting training data to preserve semantics but evade MIA signals

Membership Inference Attack nlp
PDF
defense arXiv Jan 8, 2026 · 12w ago

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks reasoning LLM text outputs by separating thinking from answering and adapting strength via semantic vectors

Output Integrity Attack nlp
1 citations PDF Code
benchmark arXiv Jan 7, 2026 · 12w ago

Analyzing Reasoning Shifts in Audio Deepfake Detection under Adversarial Attacks: The Reasoning Tax versus Shield Bifurcation

Binh Nguyen, Thai Le · Indiana University · Independent Researcher

Benchmarks reasoning robustness of audio deepfake detectors under adversarial attack, revealing a shield-vs-tax bifurcation based on acoustic perception quality

Input Manipulation Attack Output Integrity Attack audionlp
1 citations PDF
defense arXiv Jan 3, 2026 · Jan 2026

Byzantine-Robust Federated Learning Framework with Post-Quantum Secure Aggregation for Real-Time Threat Intelligence Sharing in Critical IoT Infrastructure

Milad Rahmati, Nima Rahmati · Independent Researcher

Defends federated learning against Byzantine poisoning attacks using reputation-based client filtering and post-quantum secure aggregation for IoT IDS

Data Poisoning Attack federated-learning
PDF
attack arXiv Dec 22, 2025 · Dec 2025

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Linzhi Chen, Yang Sun, Hongru Wei et al. · ShanghaiTech University · Independent Researcher

Backdoor attack on open-weight LoRA adapters using causal-guided detoxification, cutting false trigger rates by 50–70%

Model Poisoning Transfer Learning Attack nlp
1 citations PDF
tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection nlp
1 citations PDF
attack arXiv Dec 3, 2025 · Dec 2025

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik et al. · MentaLeap · Independent Researcher +1 more

Jailbreaks LLMs by replacing harmful keywords with benign substitutes in-context, hijacking internal representations to bypass safety alignment

Prompt Injection nlp
PDF Code
defense arXiv Nov 24, 2025 · Nov 2025

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen et al. · William & Mary · Independent Researcher +2 more

Self-adversarial training framework for unified multimodal models that perturbs shared visual tokens to improve adversarial and OOD robustness

Input Manipulation Attack multimodalvisionnlp
PDF Code
benchmark arXiv Nov 13, 2025 · Nov 2025

Say It Differently: Linguistic Styles as Jailbreak Vectors

Srikant Panda, Avinash Rai · Independent Researcher · Oracle AI

Benchmarks 11 linguistic styles (fear, curiosity, compassion) as jailbreak vectors, boosting LLM attack success by up to 57 points

Prompt Injection nlp
1 citations PDF
attack arXiv Oct 30, 2025 · Oct 2025

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer et al. · Independent Researcher · Stanford University +3 more

Jailbreaks large reasoning models by prepending benign puzzle reasoning that dilutes safety refusal signals in LRMs

Prompt Injection nlp
3 citations PDF
Loading more papers…