ML Security Papers

Latest papers

19 papers

attack arXiv Mar 16, 2026 · 23d ago

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Zhenlin Xu, Xiaogang Zhu, Yu Yao et al. · Adelaide University · The University of Sydney +1 more

Memory poisoning attack on LLM agents that hijacks tool selection control flow across tasks via malicious memory retrieval

Prompt Injection Excessive Agency nlp

PDF

attack arXiv Mar 5, 2026 · 4w ago

Osmosis Distillation: Model Hijacking with the Fewest Samples

Yuchen Shi, Huajie Chen, Heng Xu et al. · City University of Macau · Jinan University +1 more

Poisons distilled synthetic datasets to embed hidden hijacking tasks in models fine-tuned via transfer learning

Data Poisoning Attack Transfer Learning Attack vision

PDF

attack arXiv Feb 24, 2026 · 6w ago

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

Bowen Zheng, Yongli Xiang, Ziming Hong et al. · Huazhong University of Science and Technology · The University of Sydney

Jailbreaks commercial I2V video generation models by embedding malicious visual instructions into reference images, bypassing safety filters at 83.5% success rate

Input Manipulation Attack Prompt Injection multimodalgenerativevision

3 citations PDF

attack arXiv Feb 23, 2026 · 6w ago

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

Hefei Mei, Zirui Wang, Chang Xu et al. · City University of Hong Kong · The University of Sydney

Gray-box adversarial attack on LVLM vision encoders using prototype anchoring and attention-guided perturbations, achieving 75.1% score reduction

Input Manipulation Attack Prompt Injection visionmultimodalnlp

PDF Code

defense arXiv Feb 2, 2026 · 9w ago

MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

Ruiqi Liu, Manni Cui, Ziheng Qin et al. · Institute of Automation · School of Advanced Interdisciplinary Sciences +7 more

Detects AI-generated images by projecting inputs to a real-image manifold and using reconstruction residuals as forgery signals, surpassing human experts

Output Integrity Attack visiongenerative

PDF Code

defense arXiv Jan 30, 2026 · 9w ago

DNA: Uncovering Universal Latent Forgery Knowledge

Jingtong Dou, Chuancheng Shi, Yemin Wang et al. · The University of Sydney · Xiamen University +2 more

Probes latent neurons in pre-trained vision models to detect AI-generated images without costly fine-tuning, outperforming black-box baselines

Output Integrity Attack vision

PDF

defense arXiv Jan 29, 2026 · 9w ago

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu et al. · The University of Sydney · Nanjing University of Science and Technology +2 more

Defends LLMs, diffusion models, and MLLMs from jailbreaks by tracing and severing harmful semantic circuits via sparse autoencoders and causal path analysis

Input Manipulation Attack Prompt Injection nlpvisionmultimodalgenerative

PDF

defense TPAMI Jan 27, 2026 · 10w ago

Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation

Bochao Liu, Shiming Ge, Pengju Wang et al. · Chinese Academy of Sciences · Beijing Institute of Astronautical Systems Engineering +1 more

Defends against model inversion by converting trained models to DP-guaranteed equivalents via data-free synthetic distillation without accessing private training data

Model Inversion Attack vision

PDF

attack arXiv Jan 14, 2026 · 12w ago

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Zhiyi Mou, Jingyuan Yang, Zeheng Qian et al. · Zhejiang University · The University of Sydney +2 more

Jailbreaks LLMs by spatially redistributing tokens across rows/columns/diagonals, bypassing guardrails including OpenAI Moderation API at >75% ASR

Prompt Injection nlp

PDF Code

benchmark arXiv Dec 29, 2025 · Dec 2025

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Manu, Yi Guo, Kanchana Thilakarathna et al. · The University of Sydney · University of New South Wales +1 more

Benchmarks black-box LLM DoS attacks using evolutionary and RL-based prompt search to suppress EOS and inflate output length

Model Denial of Service nlp

1 citations 1 influentialPDF

defense arXiv Dec 24, 2025 · Dec 2025

Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

Ruiqi Liu, Yi Han, Zhengbo Zhang et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +5 more

Detects AI-generated images by modeling real image manifolds rather than generator artifacts, robust to real-world degradation chains

Output Integrity Attack visiongenerative

1 citations PDF

defense arXiv Dec 24, 2025 · Dec 2025

Time-Efficient Evaluation and Enhancement of Adversarial Robustness in Deep Neural Networks

Runqi Lin · The University of Sydney

Proposes time-efficient methods for both adversarial attack evaluation and robustness enhancement in deep neural networks

Input Manipulation Attack vision

PDF

defense arXiv Dec 8, 2025 · Dec 2025

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

Ziming Hong, Tianyu Huang, Runnan Chen et al. · The University of Sydney · University of Technology Sydney +3 more

Defends 3D Gaussian Splatting assets from AI editing by lifting adversarial perturbations from 2D image space into 3D Gaussian parameters

Input Manipulation Attack visiongenerative

4 citations PDF Code

defense arXiv Dec 7, 2025 · Dec 2025

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Longjie Zhao, Ziming Hong, Zhenyang Ren et al. · The University of Sydney · The University of Melbourne +1 more

Embeds robust watermarks into 3DGS scenes resistant to diffusion-based editing via low-frequency Gaussian targeting and adversarial training

Output Integrity Attack visiongenerative

1 citations 1 influentialPDF

defense arXiv Nov 16, 2025 · Nov 2025

DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection

Jialiang Shen, Jiyang Zheng, Yunqi Xue et al. · The University of Sydney · Shanghai Jiao Tong University +3 more

Proposes blur-robust AI-generated image detector via DINO-based teacher-student knowledge distillation for real-world motion degradation

Output Integrity Attack vision

1 citations PDF Code

defense arXiv Nov 12, 2025 · Nov 2025

GuardFed: A Trustworthy Federated Learning Framework Against Dual-Facet Attacks

Yanli Li, Yanan Zhou, Zhongliang Guo et al. · Nantong University · The University of Sydney +3 more

Introduces dual-facet Byzantine FL attack degrading accuracy and fairness simultaneously, defended by trust-score aggregation in GuardFed

Data Poisoning Attack federated-learning

PDF

benchmark arXiv Oct 11, 2025 · Oct 2025

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Zonghao Ying, Yangguang Shao, Jianle Gan et al. · Beihang University · Chinese Academy of Sciences +7 more

Benchmark evaluating LVLM web agent security across six attack vectors in realistic web environments, exposing universal vulnerabilities across 9 models

Prompt Injection Excessive Agency multimodalnlp

5 citations PDF

attack arXiv Sep 25, 2025 · Sep 2025

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Runqi Lin, Alasdair Paren, Suqin Yuan et al. · The University of Sydney · University of Oxford

Improves transferability of adversarial visual jailbreaks against closed-source MLLMs via loss landscape flattening and feature over-reliance correction

Input Manipulation Attack Prompt Injection visionmultimodalnlp

6 citations PDF

attack arXiv Sep 24, 2025 · Sep 2025

Generative Model Inversion Through the Lens of the Manifold Hypothesis

Xiong Peng, Bo Han, Fengfei Yu et al. · Hong Kong Baptist University · The University of Sydney +2 more

Explains why generative model inversion attacks work via manifold theory and proposes methods to amplify their effectiveness

Model Inversion Attack visiongenerative

PDF

Latest papers

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Osmosis Distillation: Model Hijacking with the Fewest Samples

VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models

PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention

MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

DNA: Uncovering Universal Latent Forgery Knowledge

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Privacy-Preserving Model Transcription with Differentially Private Synthetic Distillation

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

Time-Efficient Evaluation and Enhancement of Adversarial Robustness in Deep Neural Networks

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

DINO-Detect: A Simple yet Effective Framework for Blur-Robust AI-Generated Image Detection

GuardFed: A Trustworthy Federated Learning Framework Against Dual-Facet Attacks

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction

Generative Model Inversion Through the Lens of the Manifold Hypothesis

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue