ML Security Papers

Latest papers

11 papers

attack arXiv Mar 20, 2026 · 19d ago

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Wenjing Hong, Zhonghua Rong, Li Wang et al. · Shenzhen University · Ltd +2 more

Automated multi-objective evolutionary search framework discovering diverse long-tail jailbreak attacks via encryption-decryption prompt transformations

Prompt Injection nlp

PDF

defense arXiv Feb 7, 2026 · 8w ago

MemPot: Defending Against Memory Extraction Attack with Optimized Honeypots

Yuhao Wang, Shengfang Zhai, Guanghao Jin et al. · National University of Singapore · Southern University of Science and Technology +1 more

Defends LLM agent memory from adversarial data extraction by injecting optimized honeypot documents with SPRT-based sequential attacker detection

Sensitive Information Disclosure nlp

PDF

defense arXiv Jan 31, 2026 · 9w ago

Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Jingnan Zheng, Jingjun Xu, Yanzhen Luo et al. · National University of Singapore · Southern University of Science and Technology +2 more

Defends Large Reasoning Models from jailbreaks by steering hidden-state activations to enforce safety compliance over sycophancy

Prompt Injection nlp

PDF Code

defense arXiv Jan 31, 2026 · 9w ago

Provable Model Provenance Set for Large Language Models

Xiaoqi Qiu, Hao Zeng, Zhiyu Hou et al. · Southern University of Science and Technology

Detects unauthorized LLM derivation with provable statistical confidence by constructing model provenance sets via sequential hypothesis testing

Model Theft nlp

PDF

benchmark arXiv Dec 30, 2025 · Dec 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Intelligent Systems +1 more

Benchmarks jailbreak attacks against full LLM deployment pipelines with safety filters, finding prior studies overestimated attack success

Prompt Injection nlp

PDF

benchmark arXiv Nov 8, 2025 · Nov 2025

Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu, Yuansen Zhang, Yinggui Wang · Southern University of Science and Technology · Ant Group

Benchmark dataset and fine-tuning approach for training LLMs to detect malicious MCP server risks from system logs

Insecure Plugin Design nlp

PDF Code

attack arXiv Oct 29, 2025 · Oct 2025

NetEcho: From Real-World Streaming Side-Channels to Full LLM Conversation Recovery

Zheng Zhang, Guanlong Wu, Sen Deng et al. · Southern University of Science and Technology · The Hong Kong University of Science and Technology

Recovers private LLM conversations from encrypted streaming traffic side-channels, bypassing traffic padding and obfuscation defenses with ~70% fidelity.

Sensitive Information Disclosure nlp

PDF

In the rapidly expanding landscape of Large Language Model (LLM) applications, real-time output streaming has become the dominant interaction paradigm. While this enhances user experience, recent research reveals that it exposes a non-trivial attack surface through network side-channels. Adversaries can exploit patterns in encrypted traffic to infer sensitive information and reconstruct private conversations. In response, LLM providers and third-party services are deploying defenses such as traffic padding and obfuscation to mitigate these vulnerabilities. This paper starts by presenting a systematic analysis of contemporary side-channel defenses in mainstream LLM applications, with a focus on services from vendors like OpenAI and DeepSeek. We identify and examine seven representative deployment scenarios, each incorporating active/passive mitigation techniques. Despite these enhanced security measures, our investigation uncovers significant residual information that remains vulnerable to leakage within the network traffic. Building on this discovery, we introduce NetEcho, a novel, LLM-based framework that comprehensively unleashes the network side-channel risks of today's LLM applications. NetEcho is designed to recover entire conversations -- including both user prompts and LLM responses -- directly from encrypted network traffic. It features a deliberate design that ensures high-fidelity text recovery, transferability across different deployment scenarios, and moderate operational cost. In our evaluations on medical and legal applications built upon leading models like DeepSeek-v3 and GPT-4o, NetEcho can recover avg $\sim$70\% information of each conversation, demonstrating a critical limitation in current defense mechanisms. We conclude by discussing the implications of our findings and proposing future directions for augmenting network traffic security.

llm Southern University of Science and Technology · The Hong Kong University of Science and Technology

PDF arXiv DOI

tool arXiv Oct 10, 2025 · Oct 2025

Provable Training Data Identification for Large Language Models

Zhenlong Liu, Hao Zeng, Weiran Huang et al. · Southern University of Science and Technology · Shanghai Innovation Institute +1 more

Set-level membership inference for LLMs with provable false identification rate control via conformal p-values and BH procedure

Membership Inference Attack nlp

PDF

benchmark arXiv Oct 10, 2025 · Oct 2025

On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Zhi Yang, Changwu Huang, Ke Tang et al. · Southern University of Science and Technology · Lingnan University

Proposes a tighter membership inference game to audit group privacy risk disparity and adaptive DP-SGD to equalize protection across demographic groups

Membership Inference Attack tabularvision

PDF Code

defense arXiv Oct 7, 2025 · Oct 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Qingyu Yin, Chak Tou Leong, Linyi Yang et al. · Zhejiang University · Xiaohongshu Inc. +6 more

Reveals mechanistic cause of safety alignment failure in reasoning LLMs and proposes data-efficient alignment repair via refusal cliff data selection

Prompt Injection nlp

2 citations PDF Code

attack arXiv Jan 2, 2025 · Jan 2025

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Linhao Huang, Xue Jiang, Zhiqiang Wang et al. · Tsinghua University · Peng Cheng Laboratory +4 more

Black-box adversarial attack transfers from image surrogate models to video MLLMs via spatiotemporal perturbation propagation

Input Manipulation Attack visionmultimodalnlp

6 citations PDF

Latest papers

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

MemPot: Defending Against Memory Extraction Attack with Optimized Honeypots

Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Provable Model Provenance Set for Large Language Models

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Can LLM Infer Risk Information From MCP Server System Logs?

NetEcho: From Real-World Streaming Side-Channels to Full LLM Conversation Recovery

Provable Training Data Identification for Large Language Models

On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue