Latest papers

25 papers
benchmark arXiv Mar 8, 2026 · 29d ago

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li et al. · Singapore Management University · The University of Melbourne +1 more

Benchmarks beneficial uses of LLM backdoors for safety enforcement, access control, and watermarking via trigger conditioning

Model Poisoning Prompt Injection nlp
PDF Code
attack arXiv Feb 16, 2026 · 7w ago

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

In Chong Choi, Jiacheng Zhang, Feng Liu et al. · The University of Melbourne · The University of Adelaide

Multi-turn jailbreak attack on VLMs that adaptively alternates text and image inputs to bypass safety alignment

Prompt Injection multimodalnlp
PDF Code
defense arXiv Feb 12, 2026 · 7w ago

Semantic-aware Adversarial Fine-tuning for CLIP

Jiacheng Zhang, Jinhao Li, Hanxun Huang et al. · The University of Melbourne

Defends CLIP zero-shot classifiers via adversarial fine-tuning with semantically richer adversarial examples from LLM-generated description ensembles

Input Manipulation Attack visionnlpmultimodal
PDF Code
attack arXiv Feb 11, 2026 · 7w ago

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Shuyu Chang, Haiping Huang, Yanjun Zhang et al. · Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence +5 more

Backdoor attack on code models using sharpness-aware training and Gumbel-Softmax triggers for cross-dataset transferability and stealthiness

Model Poisoning nlp
PDF
attack arXiv Feb 1, 2026 · 9w ago

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
attack arXiv Jan 29, 2026 · 9w ago

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp
PDF
defense arXiv Jan 19, 2026 · 11w ago

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

Zhenhua Xu, Xiaoning Tian, Wenjun Zeng et al. · Zhejiang University · GenTel.io +4 more

Defends LLM IP by embedding kinship-narrative knowledge into model weights for stealthy, robust ownership verification

Model Theft Model Theft nlp
PDF Code
attack arXiv Jan 19, 2026 · 11w ago

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Anudeex Shetty, Aditya Joshi, Salil S. Kanhere · UNSW Sydney · The University of Melbourne

Novel drunk-persona jailbreak attack on LLMs bypasses safety tuning and induces privacy leaks across five models

Prompt Injection Sensitive Information Disclosure nlp
PDF
benchmark arXiv Dec 23, 2025 · Dec 2025

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan et al. · Harbin Institute of Technology · MBZUAI +2 more

Benchmarks indirect prompt injection attacks on LLM resume screeners and proposes LoRA-based FIDS defense achieving 26% attack reduction

Prompt Injection nlp
1 citations PDF Code
defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp
PDF
defense arXiv Dec 7, 2025 · Dec 2025

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Longjie Zhao, Ziming Hong, Zhenyang Ren et al. · The University of Sydney · The University of Melbourne +1 more

Embeds robust watermarks into 3DGS scenes resistant to diffusion-based editing via low-frequency Gaussian targeting and adversarial training

Output Integrity Attack visiongenerative
1 citations 1 influentialPDF
defense arXiv Nov 28, 2025 · Nov 2025

Watermarks for Embeddings-as-a-Service Large Language Models

Anudeex Shetty · The University of Melbourne

Attacks EaaS embedding watermarks via paraphrasing, then proposes WET linear-transformation watermark robust against model cloning

Model Theft Model Theft nlp
PDF
benchmark arXiv Nov 24, 2025 · Nov 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang et al. · Fudan University · Singapore Management University +1 more

Benchmarks backdoor attacks on VLMs, finding text triggers achieve 90%+ success at just 1% poisoning rate

Model Poisoning visionnlpmultimodal
PDF Code
attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning nlp
2 citations PDF Code
defense arXiv Nov 3, 2025 · Nov 2025

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian et al. · The Hong Kong University of Science and Technology · Hong Kong Baptist University +4 more

Proposes ConV, a generated-image detector exploiting data manifold geometry requiring no generated training samples

Output Integrity Attack visiongenerative
2 citations PDF Code
benchmark arXiv Oct 15, 2025 · Oct 2025

Signature in Code Backdoor Detection, how far are we?

Quoc Hung Le, Thanh Le-Cong, Bach Le et al. · North Carolina State University · The University of Melbourne

Benchmarks Spectral Signature backdoor defenses on code LLMs, finds configs suboptimal, proposes NPV proxy metric requiring no retraining

Model Poisoning nlp
PDF
defense arXiv Oct 9, 2025 · Oct 2025

SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott et al. · The University of Melbourne · King Fahd University of Petroleum and Minerals

Defends decentralized federated learning against Byzantine poisoning attacks using sketch-based neighbor screening to cut communication 50-70%

Data Poisoning Attack federated-learning
1 citations PDF
attack arXiv Sep 29, 2025 · Sep 2025

GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh et al. · The University of Melbourne · University of Maryland

Gradient-based adversarial image synthesis that induces object hallucinations in multimodal LLMs via diffusion-guided embedding-space optimization

Input Manipulation Attack Prompt Injection visionnlpmultimodalgenerative
PDF
attack arXiv Sep 24, 2025 · Sep 2025

Generative Model Inversion Through the Lens of the Manifold Hypothesis

Xiong Peng, Bo Han, Fengfei Yu et al. · Hong Kong Baptist University · The University of Sydney +2 more

Explains why generative model inversion attacks work via manifold theory and proposes methods to amplify their effectiveness

Model Inversion Attack visiongenerative
PDF
defense arXiv Sep 21, 2025 · Sep 2025

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn et al. · Monash University · The University of Melbourne +1 more

Defends LLM guardrails against obfuscation- and template-based jailbreaks using a deciphering layer and LoRA fine-tuning

Prompt Injection nlp
PDF
Loading more papers…