Latest papers

10 papers
tool arXiv Mar 18, 2026 · 19d ago

EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Chenyang Zhu, Maorong Wang, Jun Liu et al. · The University of Tokyo · National Institute of Informatics

Agentic framework orchestrating multiple AIGI detectors via reinforcement learning for extensible, train-free AI-generated image detection

Output Integrity Attack visionmultimodalnlp
PDF
defense arXiv Mar 6, 2026 · 4w ago

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

Jingyuan Feng, Andrew Gambardella, Gouki Minegishi et al. · The University of Tokyo

Defends LLMs against jailbreaks via an explicit safety bit that makes alignment interpretable and overridable, achieving near-zero ASR

Prompt Injection nlp
PDF
attack arXiv Feb 28, 2026 · 5w ago

Mathematical Foundations of Poisoning Attacks on Linear Regression over Cumulative Distribution Functions

Atsuki Sato, Martin Aumüller, Yusuke Matsui · The University of Tokyo · IT University of Copenhagen

Proves optimal poisoning attacks on linear regression CDF models, showing greedy multi-point attacks are suboptimal and bounding their maximum impact

Data Poisoning Attack tabular
PDF
attack arXiv Jan 17, 2026 · 11w ago

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li et al. · University of Macau · National Institute of Informatics +2 more

Hard-label black-box adversarial attack using frequency-domain initialization and pattern-driven optimization to recover gradient sign information

Input Manipulation Attack vision
PDF Code
defense arXiv Jan 5, 2026 · Jan 2026

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Kaede Shiohara, Toshihiko Yamasaki, Vladislav Golyanik · The University of Tokyo · Max Planck Institute for Informatics

Self-supervised deepfake detector using personalized audio-to-expression diffusion models to catch unseen face forgeries zero-shot

Output Integrity Attack visionaudiomultimodalgenerative
PDF
benchmark arXiv Jan 4, 2026 · Jan 2026

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu et al. · Kyoto University · Hohai University +3 more

Benchmarks 27 LLMs against 50K+ multi-turn medical jailbreak conversations in Japanese, finding fine-tuned medical models are most vulnerable

Prompt Injection nlp
PDF
defense arXiv Dec 17, 2025 · Dec 2025

SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

Hongbo Wang, MaungMaung AprilPyone, Isao Echizen · The University of Tokyo · National Institute of Informatics +1 more

Neuron-level white-box defense suppresses toxic expert neurons in VLMs, cutting harmful outputs from 48% to 2.5% under adversarial jailbreaks

Prompt Injection nlpmultimodalvision
1 citations PDF Code
defense arXiv Oct 16, 2025 · Oct 2025

Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks

ChenYu Wu, Yi Wang, Yang Liao · The University of Tokyo · Xi’an Jiaotong University

Proactive honeypot defense uses a fine-tuned bait model to lure multi-turn LLM jailbreak attackers into revealing malicious intent

Prompt Injection nlp
PDF
defense Asia-Pacific Signal and Inform... Oct 10, 2025 · Oct 2025

Uncolorable Examples: Preventing Unauthorized AI Colorization via Perception-Aware Chroma-Restrictive Perturbation

Yuki Nii, Futa Waseda, Ching-Chun Chang et al. · The University of Tokyo · National Institute of Informatics

Adversarial perturbations embedded in grayscale images to disrupt AI colorization models and prevent unauthorized copyright infringement

Output Integrity Attack visiongenerative
PDF
benchmark arXiv Oct 1, 2025 · Oct 2025

Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda et al. · Turing Inc. · Institute of Science Tokyo +2 more

Reveals Differential Attention transformers are structurally more fragile to adversarial perturbations than standard attention via negative gradient alignment theory

Input Manipulation Attack visionmultimodal
PDF