Latest papers

11 papers
defense arXiv Jan 6, 2026 · Jan 2026

SLIM: Stealthy Low-Coverage Black-Box Watermarking via Latent-Space Confusion Zones

Hengyu Wu, Yang Cao · Institute of Science Tokyo

Watermarks LLM training data via latent-space confusion zones enabling black-box provenance verification with ultra-low coverage

Output Integrity Attack nlp
PDF
attack arXiv Dec 14, 2025 · Dec 2025

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan, Zhe Yu, Jun Sakuma · Institute of Science Tokyo · RIKEN AIP

PGP attack exploits pretrained LLM representations to transfer gradient-optimized jailbreak prompts to black-box finetuned derivatives

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Dec 9, 2025 · Dec 2025

Disrupting Hierarchical Reasoning: Adversarial Protection for Geographic Privacy in Multimodal Reasoning Models

Jiaming Zhang, Che Wang, Yang Cao et al. · Nanyang Technological University · Peking University +2 more

Defends geographic privacy from VLM inference using concept-aware adversarial image perturbations that cascade through hierarchical reasoning chains

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF Code
tool arXiv Nov 24, 2025 · Nov 2025

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Yixin Wu, Rui Wen, Chi Cui et al. · CISPA Helmholtz Center for Information Security · Institute of Science Tokyo

Autonomous LLM agent automates membership inference, model stealing, and data reconstruction attacks on ML services with near-expert accuracy at $0.627/run.

Membership Inference Attack Model Theft Model Inversion Attack nlp
PDF
attack arXiv Nov 12, 2025 · Nov 2025

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment

Shigeki Kusaka, Keita Saito, Mikoto Kudo et al. · University of Tsukuba · RIKEN +2 more

Theoretically minimizes label-flipping attack cost during RLHF/DPO alignment using convex optimization post-processing

Data Poisoning Attack Training Data Poisoning nlp
1 citations PDF Code
defense arXiv Nov 10, 2025 · Nov 2025

Privacy on the Fly: A Predictive Adversarial Transformation Network for Mobile Sensor Data

Tianle Song, Chenhao Lin, Yang Cao et al. · Xi’an Jiaotong University · Institute of Science Tokyo

Defends mobile sensor privacy by predictively generating adversarial perturbations that fool ML attribute-inference models in real time

Input Manipulation Attack timeseries
PDF
benchmark arXiv Oct 22, 2025 · Oct 2025

Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko et al. · Institute of Science Tokyo · University of Pennsylvania +1 more

Proves MIAs and machine text detectors share the same optimal metric, demonstrating strong cross-task transferability with a unified evaluation suite.

Membership Inference Attack Output Integrity Attack nlp
1 citations 1 influentialPDF Code
attack arXiv Oct 9, 2025 · Oct 2025

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai et al. · Institute of Science Tokyo · RIKEN AIP

Multi-turn jailbreak framework using five structured conversation patterns to systematically bypass LLM safety alignment across twelve models

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Oct 1, 2025 · Oct 2025

Understanding Sensitivity of Differential Attention through the Lens of Adversarial Robustness

Tsubasa Takahashi, Shojiro Yamabe, Futa Waseda et al. · Turing Inc. · Institute of Science Tokyo +2 more

Reveals Differential Attention transformers are structurally more fragile to adversarial perturbations than standard attention via negative gradient alignment theory

Input Manipulation Attack visionmultimodal
PDF
defense arXiv Oct 1, 2025 · Oct 2025

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Shojiro Yamabe, Jun Sakuma · Institute of Science Tokyo · RIKEN

Discovers token-injection jailbreak in diffusion LMs and proposes safety alignment to defend contaminated intermediate denoising states

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Sep 27, 2025 · Sep 2025

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Manjiang Yu, Priyanka Singh, Xue Li et al. · The University of Queensland · Institute of Science Tokyo

Token-selective DP-SGD variant concentrates noise on sensitive tokens to prevent LLM training-data extraction while cutting DP overhead by 90%

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations PDF Code