Latest papers

11 papers
defense arXiv Apr 2, 2026 · 4d ago

Combating Data Laundering in LLM Training

Muxing Li, Zesheng Ye, Sharon Li et al. · University of Melbourne · University of Wisconsin-Madison

Detects unauthorized LLM training data use even when original data has been laundered through style transformations

Membership Inference Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Feb 27, 2026 · 5w ago

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu et al. · Shanghai Qi Zhi Institute · University of Melbourne +1 more

Automated multi-agent system translates jailbreak papers into executable modules for standardized, reproducible LLM robustness benchmarking

Prompt Injection nlp
PDF Code
defense arXiv Feb 22, 2026 · 6w ago

ReVision : A Post-Hoc, Vision-Based Technique for Replacing Unacceptable Concepts in Image Generation Pipeline

Gurjot Singh, Prabhjot Singh, Aashima Sharma et al. · University of Waterloo · University of Melbourne +2 more

Post-hoc VLM-assisted framework detects and edits policy-violating content in diffusion model outputs without retraining

Output Integrity Attack visiongenerative
PDF
defense arXiv Feb 1, 2026 · 9w ago

HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang et al. · University of Melbourne

Audio deepfake detector using hierarchical contrastive attention across SSL transformer layers to expose synthetic speech artefacts

Output Integrity Attack audio
PDF Code
defense arXiv Dec 17, 2025 · Dec 2025

Where is the Watermark? Interpretable Watermark Detection at the Block Level

Maria Bulychev, Neil G. Marchant, Benjamin I. P. Rubinstein · University of Melbourne

Proposes block-level DWT image watermarking with spatial detection maps for interpretable tamper localization and content provenance

Output Integrity Attack vision
PDF
defense arXiv Dec 9, 2025 · Dec 2025

PrivTune: Efficient and Privacy-Preserving Fine-Tuning of Large Language Models via Device-Cloud Collaboration

Yi Liu, Weixiang Han, Chengjun Cai et al. · City University of Hong Kong · University of Melbourne

Defends private LLM fine-tuning data against embedding inversion attacks by injecting optimization-guided noise into split learning token representations

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations PDF
defense arXiv Dec 8, 2025 · Dec 2025

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

Ziming Hong, Tianyu Huang, Runnan Chen et al. · The University of Sydney · University of Technology Sydney +3 more

Defends 3D Gaussian Splatting assets from AI editing by lifting adversarial perturbations from 2D image space into 3D Gaussian parameters

Input Manipulation Attack visiongenerative
4 citations PDF Code
defense arXiv Nov 12, 2025 · Nov 2025

AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness

Zhuoqun Huang, Neil G. Marchant, Olga Ohrimenko et al. · University of Melbourne

Certified robustness defense for text classifiers using adaptive deletion-rate randomized smoothing against edit distance adversarial attacks

Input Manipulation Attack nlp
PDF
defense arXiv Oct 9, 2025 · Oct 2025

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang et al. · South China University of Technology · Pazhou Lab +4 more

Detects AI-generated videos via physics-driven NSG statistic quantifying violations of probability flow conservation laws

Output Integrity Attack visiongenerative
6 citations 1 influentialPDF Code
attack arXiv Sep 24, 2025 · Sep 2025

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou et al. · Southeast University · Johns Hopkins University +3 more

Proposes Proxy Targeted Attack to craft generalizable, anomaly-evasive adversarial examples against multimodal encoders like ImageBind

Input Manipulation Attack visionmultimodalnlp
2 citations PDF
tool arXiv Aug 25, 2025 · Aug 2025

PhantomLint: Principled Detection of Hidden LLM Prompts in Structured Documents

Toby Murray · University of Melbourne

Detects hidden indirect prompt injections in PDFs and HTML docs with 0.092% false-positive rate across 3,402 real documents

Prompt Injection nlp
PDF Code