Latest papers

22 papers
attack arXiv Mar 31, 2026 · 6d ago

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Rui Bao, Zheng Gao, Xiaoyu Li et al. · University of New South Wales · Griffith University

Training-free attack that removes diffusion-based watermarks by deflecting generation trajectories, achieving 95-100% success across nine methods

Output Integrity Attack visiongenerative
PDF
defense arXiv Mar 25, 2026 · 12d ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp
PDF
defense arXiv Mar 13, 2026 · 24d ago

SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Zheng Gao, Yifan Yang, Xiaoyu Li et al. · University of New South Wales · Griffith University

Fine-grained semantic watermarking for diffusion models that embeds tamper-detectable signals across four semantic factors in initial noise

Output Integrity Attack visiongenerative
PDF
attack arXiv Feb 25, 2026 · 5w ago

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Zheng Gao, Xiaoyu Li, Zhicheng Bao et al. · University of New South Wales · Griffith University

LLM-guided semantic injection attack that bypasses content-aware watermarks in diffusion-generated images by preserving global coherence while invalidating watermark bindings

Output Integrity Attack visiongenerativenlp
PDF
tool arXiv Feb 21, 2026 · 6w ago

FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

Zhou Liu, Tonghua Su, Hongshi Zhang et al. · Harbin Institute of Technology · DZ-Matrix +3 more

Multimodal LLM system detects and localizes AI-generated image forgeries by fusing RGB and frequency-domain forensic features

Output Integrity Attack visionmultimodal
PDF
benchmark arXiv Feb 6, 2026 · 8w ago

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Yi Liu, Zhihao Chen, Yanjun Zhang et al. · Quantstamp · Fujian Normal University +4 more

Empirical study of 98,380 LLM agent skills finds 157 malicious ones using supply chain theft and instruction hijacking

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp
2 citations 1 influentialPDF
defense arXiv Feb 4, 2026 · 8w ago

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative
PDF
survey arXiv Feb 2, 2026 · 9w ago

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Alsharif Abuadbba, Nazatul Sultan, Surya Nepal et al. · CSIRO's Data61 · University of New South Wales

Proposes the 4C Framework to systematically organize and govern agentic AI security risks across Core, Connection, Cognition, and Compliance dimensions

Excessive Agency Prompt Injection Insecure Plugin Design nlp
PDF
defense arXiv Jan 27, 2026 · 9w ago

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya et al. · University of New South Wales · University of Wollongong

Multi-agent auto-healing defense framework that detects and adapts to sponge attacks exhausting LLM compute resources

Model Denial of Service nlp
PDF
attack arXiv Jan 19, 2026 · 11w ago

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Suyang Sun, Weifei Jin, Yuxin Cao et al. · Beijing University of Posts and Telecommunications · National University of Singapore +1 more

Universal adversarial audio perturbations that simultaneously fool ASR transcription and speaker recognition in voice control systems

Input Manipulation Attack audio
PDF Code
tool arXiv Jan 15, 2026 · 11w ago

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng et al. · Nanyang Technological University · Tianjin University +4 more

Scans 31K AI agent skills from marketplaces, finding 26% contain vulnerabilities including prompt injection, data exfiltration, and supply chain risks

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp
8 citations 2 influentialPDF
defense arXiv Jan 5, 2026 · Jan 2026

FMVP: Masked Flow Matching for Adversarial Video Purification

Duoxun Tang, Xueyi Zhang, Chak Hin Wang et al. · Tsinghua University · The Chinese University of Hong Kong +2 more

Defends video recognition models against PGD and CW attacks via flow-matching purification with masking and frequency-gated loss

Input Manipulation Attack vision
PDF
benchmark arXiv Dec 29, 2025 · Dec 2025

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Manu, Yi Guo, Kanchana Thilakarathna et al. · The University of Sydney · University of New South Wales +1 more

Benchmarks black-box LLM DoS attacks using evolutionary and RL-based prompt search to suppress EOS and inflate output length

Model Denial of Service nlp
1 citations 1 influentialPDF
attack TrustCom Nov 17, 2025 · Nov 2025

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Siyang Cheng, Gaotian Liu, Rui Mei et al. · iFLYTEK · Anhui SparkShield Intelligent Technology +5 more

Evolutionary jailbreak framework using multi-level text perturbations and semantic fitness to bypass LLM alignment at high success rates

Prompt Injection nlp
PDF
attack arXiv Sep 25, 2025 · Sep 2025

Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao, Wei Song, Jingling Xue et al. · National University of Singapore · University of New South Wales +1 more

Black-box adversarial perturbation attack suppresses harmful frame selection in VideoLLM prompt-guided sampling, achieving 82–99% success

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
tool arXiv Sep 8, 2025 · Sep 2025

NeuroDeX: Unlocking Diverse Support in Decompiling Deep Neural Network Executables

Yilin Li, Guozhu Meng, Mingyang Sun et al. · Institute of Information Engineering · University of Chinese Academy of Sciences +1 more

Decompiles on-device DNN executables to recover model architecture and weights, enabling model theft from edge deployments

Model Theft vision
PDF
attack arXiv Aug 21, 2025 · Aug 2025

Retrieval-Augmented Review Generation for Poisoning Recommender Systems

Shiyi Yang, Xinshu Li, Guanglin Zhou et al. · University of New South Wales · CSIRO’s Data61 +2 more

Poisons recommender systems by injecting LLM-generated fake user profiles using retrieval-augmented ICL and jailbreaking to evade detection

Data Poisoning Attack nlp
PDF
attack arXiv Aug 14, 2025 · Aug 2025

Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao, Wei Song, Derui Wang et al. · National University of Singapore · University of New South Wales +1 more

Three black-box attacks exploit VideoLLM architectural blind spots to hide harmful video content from generated summaries with >90% success rate

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF Code
defense arXiv Aug 11, 2025 · Aug 2025

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang et al. · Jilin University · Griffith University +1 more

Unsupervised malicious-agent detector for LLM multi-agent systems using contrastive learning without requiring labeled attack data

Excessive Agency Prompt Injection nlpgraph
PDF Code
benchmark arXiv Aug 8, 2025 · Aug 2025

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

Lai Jiang, Yuekang Li, Xiaohan Zhang et al. · Shanghai Jiao Tong University · Zhangjiang Institute for Advanced Study +1 more

Proposes scenario-adaptive multi-dimensional jailbreak evaluation framework for LLMs, outperforming binary classifiers across 14 harm scenarios

Prompt Injection nlp
PDF Code
Loading more papers…