Latest papers

2 papers
defense arXiv Nov 10, 2025 · Nov 2025

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Liang Shan, Kaicheng Shen, Wen Wu et al. · East China Normal University · Shanghai AI Lab

Defends LLMs against implicit domain-specific jailbreaks via metacognition, evolving rule graphs, and activation steering

Prompt Injection nlp
1 citations PDF
attack arXiv Sep 24, 2025 · Sep 2025

FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models

Xin Wang, Jie Li, Zejia Weng et al. · Fudan University · Shanghai AI Lab +1 more

Adversarial image attack freezes Vision-Language-Action robotic models via bi-level optimization, achieving 76.2% cross-prompt success rate

Input Manipulation Attack Prompt Injection visionmultimodalnlp
1 citations 1 influentialPDF Code