ML Security Papers

Latest papers

5 papers

attack arXiv Jan 6, 2026 · Jan 2026

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

Xiangdong Hu, Yangyang Jiang, Qin Hu et al. · Georgia State University · Nanyang Technological University

Gamified jailbreak uses competitive game framing and image shuffling to bypass MLLM safety alignment, hitting 92% ASR on Gemini 2.5 Flash

Prompt Injection multimodalnlpvision

PDF

defense arXiv Dec 12, 2025 · Dec 2025

DFedReweighting: A Unified Framework for Objective-Oriented Reweighting in Decentralized Federated Learning

Kaichuang Zhang, Wei Yin, Jinghao Yang et al. · University of South Florida · The University of Texas Rio Grande Valley +1 more

Defends decentralized federated learning against Byzantine attacks via objective-oriented reweighting aggregation with convergence guarantees

Data Poisoning Attack federated-learning

PDF

benchmark arXiv Oct 16, 2025 · Oct 2025

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi, Pinxian Lu, Abdulkadir Erol et al. · Georgia State University · Georgia Institute of Technology +1 more

Benchmarks multi-turn jailbreak attacks on LLM agents via memory, planning, and fine-tuning to elicit online harassment

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

llm Georgia State University · Georgia Institute of Technology · University of California

PDF arXiv DOI

defense arXiv Sep 11, 2025 · Sep 2025

DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models

Honghui Xu, Shiva Shrestha, Wei Chen et al. · Kennesaw State University · Nexa AI +1 more

Defends federated LLM fine-tuning against membership inference attacks via LoRA with differential privacy noise injection

Membership Inference Attack nlpfederated-learning

PDF

survey arXiv Sep 2, 2025 · Sep 2025

A Survey: Towards Privacy and Security in Mobile Large Language Models

Honghui Xu, Kaiyang Li, Wei Chen et al. · Kennesaw State University · Georgia State University +2 more

Surveys privacy and security threats to mobile LLMs: adversarial attacks, membership inference, side-channel leakage, and defenses

Input Manipulation Attack Membership Inference Attack Prompt Injection Sensitive Information Disclosure nlp

PDF

Latest papers

GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models

DFedReweighting: A Unified Framework for Objective-Oriented Reweighting in Decentralized Federated Learning

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

DP-FedLoRA: Privacy-Enhanced Federated Fine-Tuning for On-Device Large Language Models

A Survey: Towards Privacy and Security in Mobile Large Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue