Latest papers

16 papers
defense arXiv Feb 28, 2026 · 5w ago

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen, Trung Le, Jing Wu et al. · Monash University

Defends LLMs against harmful fine-tuning attacks by pre-aligning safety in flat loss regions and gradient-weighting poisoned samples away during fine-tuning

Data Poisoning Attack Training Data Poisoning nlp
PDF
tool arXiv Feb 23, 2026 · 6w ago

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan et al. · MBZUAI · Monash University

Builds an MLLM judge that evaluates reasoning fidelity of deepfake detectors, outperforming 30x larger baselines at 96.2% accuracy

Output Integrity Attack visionmultimodal
PDF Code
tool arXiv Feb 4, 2026 · 8w ago

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Suyu Ma, Chunyang Chen, Hourieh Khalajzadeh et al. · CSIRO's Data61 · Technical University of Munich +2 more

Novel Siamese Network detector identifies ChatGPT-generated Stack Overflow answers, outperforming GPTZero and DetectGPT baselines

Output Integrity Attack nlp
PDF
attack arXiv Jan 29, 2026 · 9w ago

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Ningyuan He, Ronghong Huang, Qianqian Tang et al. · University of Science and Technology of China · Shandong University +1 more

Zero-query black-box text attacks evade LLM-based in-context learning classifiers with 95.3% success, plus joint defense recipe

Prompt Injection nlp
PDF Code
benchmark arXiv Jan 14, 2026 · 11w ago

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Fengchao Chen, Tingmin Wu, Van Nguyen et al. · Monash University · CSIRO’s Data61

Benchmarks user-mediated indirect prompt injection attacks on 12 commercial LLM agents, showing 92%+ safety bypass and excessive agency risks

Prompt Injection Excessive Agency nlp
2 citations PDF
defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp
PDF
attack arXiv Nov 18, 2025 · Nov 2025

Certified but Fooled! Breaking Certified Defences with Ghost Certificates

Quoc Viet Vo, Tashreque M. Haq, Paul Montague et al. · University of Adelaide · Defence Science and Technology Group +1 more

Imperceptible adversarial examples spoof randomized-smoothing certificates, making misclassified inputs appear strongly certified to bypass DensePure and similar defenses

Input Manipulation Attack vision
PDF Code
defense arXiv Oct 16, 2025 · Oct 2025

An Information Asymmetry Game for Trigger-based DNN Model Watermarking

Chaoyue Huang, Gejian Zhao, Hanzhou Wu et al. · Shanghai University · Guizhou Normal University +2 more

Game-theoretic framework for robust DNN model watermarking derives attacker's optimal pruning budget and exponential WSR lower bound

Model Theft vision
PDF
defense arXiv Oct 13, 2025 · Oct 2025

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma et al. · The University of Queensland · CSIRO’s Data61 +1 more

Recodes inputs into an authorized model's insensitivity subspace so only that model can process them, blocking unauthorized model exploitation

Model Theft visionmultimodal
3 citations PDF Code
defense arXiv Oct 6, 2025 · Oct 2025

SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Vrushank Ahire, Aniruddh Muley, Shivam Zample et al. · Indian Institute of Technology Ropar · Monash University

Ensemble deepfake detector combining Swin Transformers, ViTs, and texture features with frequency splitting and face-region attention

Output Integrity Attack vision
PDF
tool arXiv Sep 26, 2025 · Sep 2025

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

Yue Liu, Yanjie Zhao, Yunbo Lyu et al. · Singapore Management University · Huazhong University of Science and Technology +1 more

Empirical study and testing framework showing indirect prompt injection hijacks agentic AI coding editors with 84% attack success rate

Prompt Injection Excessive Agency nlp
1 citations PDF
defense arXiv Sep 21, 2025 · Sep 2025

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn et al. · Monash University · The University of Melbourne +1 more

Defends LLM guardrails against obfuscation- and template-based jailbreaks using a deciphering layer and LoRA fine-tuning

Prompt Injection nlp
PDF
defense arXiv Sep 21, 2025 · Sep 2025

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn et al. · Monash University · The University of Melbourne +1 more

Adaptive LLM guardrail using OOD detection and continual learning to defend against novel jailbreak attacks post-deployment

Prompt Injection nlp
PDF Code
defense arXiv Sep 7, 2025 · Sep 2025

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao et al. · City University of Macau · Monash University +1 more

Defends biometric privacy by generating adversarial GAN perturbations that blind black-box retinal age prediction models while preserving diagnostic image utility

Input Manipulation Attack vision
PDF
survey arXiv Aug 29, 2025 · Aug 2025

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing Through Examination of Generation Methods, Content Characteristics, and Countermeasures

Fengchao Chen, Tingmin Wu, Van Nguyen et al. · Monash University · CSIRO Data 61

SoK survey taxonomizing LLM safety guardrail breaching and LLM-generated phishing content detection gaps across generation methods and defenses

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Aug 8, 2025 · Aug 2025

Membership Inference Attack with Partial Features

Xurun Wang, Guangrui Liu, Xinjie Li et al. · Harbin Institute of Technology · Monash University +1 more

Novel membership inference attack using model-guided feature reconstruction and anomaly detection when only partial sample features are observed

Membership Inference Attack visiontabular
PDF