Latest papers

7 papers
attack arXiv Feb 14, 2026 · 7w ago

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

Ruomeng Ding, Yifei Pang, He Sun et al. · University of North Carolina at Chapel Hill · Carnegie Mellon University +2 more

Attacks LLM alignment pipelines by crafting benchmark-compliant rubric edits that systematically bias judge preferences and corrupt RLHF training

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Jan 6, 2026 · Jan 2026

Extracting books from production language models

Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo et al. · Stanford University · Yale University

Extracts copyrighted books near-verbatim from Claude, GPT-4.1, Gemini, and Grok using Best-of-N jailbreaks and iterative continuation prompts

Model Inversion Attack Sensitive Information Disclosure Prompt Injection nlp
5 citations PDF
defense arXiv Dec 29, 2025 · Dec 2025

RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking

Jiawei Liu, Zhuo Chen, Rui Zhu et al. · Wuhan University · Yale University +1 more

Certified randomized-masking defense for neural ranking models against adversarial text perturbations in search and RAG systems

Input Manipulation Attack nlp
PDF
defense arXiv Nov 24, 2025 · Nov 2025

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Xurui Li, Kaisong Song, Rui Zhu et al. · Fudan University · Alibaba Group +3 more

Co-evolving attack-defense framework uses MCTS-based jailbreak exploration and curriculum RL to jointly train stronger LLM safety alignment

Prompt Injection nlp
2 citations PDF Code
defense arXiv Oct 24, 2025 · Oct 2025

Optimal Detection for Language Watermarks with Pseudorandom Collision

T. Tony Cai, Xiang Li, Qi Long et al. · University of Pennsylvania · Yale University

Derives minimax-optimal detection rules for LLM text watermarks under pseudorandom collisions with rigorous Type I error control

Output Integrity Attack nlp
PDF
defense arXiv Oct 3, 2025 · Oct 2025

Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles

Dong Lao, Yuxiang Zhang, Haniyeh Ehsani Oskouie et al. · Louisiana State University · University of California +1 more

Training-free, architecture-agnostic test-time defense against adversarial attacks using stochastic resonance over latent translational ensembles

Input Manipulation Attack vision
PDF
attack arXiv Aug 20, 2025 · Aug 2025

MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs

Ruyi Ding, Tianhong Xu, Xinyi Shen et al. · Louisiana State University · Northeastern University +1 more

Side-channel attacks on MoE LLMs/VLMs reconstruct user prompts and responses via CPU cache and GPU TLB hardware signals

Sensitive Information Disclosure nlpmultimodalvision
PDF