Latest papers

11 papers
attack arXiv Mar 20, 2026 · 19d ago

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Wenjing Hong, Zhonghua Rong, Li Wang et al. · Shenzhen University · Ltd +2 more

Automated multi-objective evolutionary search framework discovering diverse long-tail jailbreak attacks via encryption-decryption prompt transformations

Prompt Injection nlp
PDF
defense arXiv Feb 7, 2026 · 8w ago

MemPot: Defending Against Memory Extraction Attack with Optimized Honeypots

Yuhao Wang, Shengfang Zhai, Guanghao Jin et al. · National University of Singapore · Southern University of Science and Technology +1 more

Defends LLM agent memory from adversarial data extraction by injecting optimized honeypot documents with SPRT-based sequential attacker detection

Sensitive Information Disclosure nlp
PDF
defense arXiv Jan 31, 2026 · 9w ago

Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Jingnan Zheng, Jingjun Xu, Yanzhen Luo et al. · National University of Singapore · Southern University of Science and Technology +2 more

Defends Large Reasoning Models from jailbreaks by steering hidden-state activations to enforce safety compliance over sycophancy

Prompt Injection nlp
PDF Code
defense arXiv Jan 31, 2026 · 9w ago

Provable Model Provenance Set for Large Language Models

Xiaoqi Qiu, Hao Zeng, Zhiyu Hou et al. · Southern University of Science and Technology

Detects unauthorized LLM derivation with provable statistical confidence by constructing model provenance sets via sequential hypothesis testing

Model Theft nlp
PDF
benchmark arXiv Dec 30, 2025 · Dec 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Intelligent Systems +1 more

Benchmarks jailbreak attacks against full LLM deployment pipelines with safety filters, finding prior studies overestimated attack success

Prompt Injection nlp
PDF
benchmark arXiv Nov 8, 2025 · Nov 2025

Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu, Yuansen Zhang, Yinggui Wang · Southern University of Science and Technology · Ant Group

Benchmark dataset and fine-tuning approach for training LLMs to detect malicious MCP server risks from system logs

Insecure Plugin Design nlp
PDF Code
attack arXiv Oct 29, 2025 · Oct 2025

NetEcho: From Real-World Streaming Side-Channels to Full LLM Conversation Recovery

Zheng Zhang, Guanlong Wu, Sen Deng et al. · Southern University of Science and Technology · The Hong Kong University of Science and Technology

Recovers private LLM conversations from encrypted streaming traffic side-channels, bypassing traffic padding and obfuscation defenses with ~70% fidelity.

Sensitive Information Disclosure nlp
PDF
tool arXiv Oct 10, 2025 · Oct 2025

Provable Training Data Identification for Large Language Models

Zhenlong Liu, Hao Zeng, Weiran Huang et al. · Southern University of Science and Technology · Shanghai Innovation Institute +1 more

Set-level membership inference for LLMs with provable false identification rate control via conformal p-values and BH procedure

Membership Inference Attack nlp
PDF
benchmark arXiv Oct 10, 2025 · Oct 2025

On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Zhi Yang, Changwu Huang, Ke Tang et al. · Southern University of Science and Technology · Lingnan University

Proposes a tighter membership inference game to audit group privacy risk disparity and adaptive DP-SGD to equalize protection across demographic groups

Membership Inference Attack tabularvision
PDF Code
defense arXiv Oct 7, 2025 · Oct 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Qingyu Yin, Chak Tou Leong, Linyi Yang et al. · Zhejiang University · Xiaohongshu Inc. +6 more

Reveals mechanistic cause of safety alignment failure in reasoning LLMs and proposes data-efficient alignment repair via refusal cliff data selection

Prompt Injection nlp
2 citations PDF Code
attack arXiv Jan 2, 2025 · Jan 2025

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Linhao Huang, Xue Jiang, Zhiqiang Wang et al. · Tsinghua University · Peng Cheng Laboratory +4 more

Black-box adversarial attack transfers from image surrogate models to video MLLMs via spatiotemporal perturbation propagation

Input Manipulation Attack visionmultimodalnlp
6 citations PDF