ML Security Papers

Latest papers

21 papers

defense arXiv Mar 28, 2026 · 9d ago

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Shivang Chopra, Shaunak Halbe, Chengyue Huan et al. · Georgia Institute of Technology

Fine-tuning framework that improves VLM adversarial robustness by 13.5% while preserving accuracy through loss curvature and feature alignment

Input Manipulation Attack visionmultimodal

PDF

benchmark arXiv Mar 11, 2026 · 26d ago

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan et al. · Georgia Institute of Technology · Stanford University +1 more

Exposes LLM unlearning brittleness by showing multi-hop and alias queries recover supposedly forgotten information missed by static benchmarks

Sensitive Information Disclosure nlp

PDF Code

benchmark arXiv Jan 21, 2026 · 10w ago

Auditing Language Model Unlearning via Information Decomposition

Anmol Goel, Alan Ritter, Iryna Gurevych · Technical University of Darmstadt · National Research Center for Applied Cybersecurity ATHENE +1 more

Audits LLM unlearning via Partial Information Decomposition, revealing residual training data remains vulnerable to adversarial reconstruction attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

benchmark arXiv Jan 19, 2026 · 11w ago

Verifying Local Robustness of Pruned Safety-Critical Networks

Minh Le, Phuong Cao · Georgia Institute of Technology · University of Illinois Urbana-Champaign

Empirically shows pruning ratio non-linearly affects formal L∞ adversarial robustness certificates in safety-critical vision models

Input Manipulation Attack vision

PDF

defense arXiv Jan 14, 2026 · 11w ago

Semantic Differentiation for Tackling Challenges in Watermarking Low-Entropy Constrained Generation Outputs

Nghia T. Le, Alan Ritter, Kartik Goyal · Georgia Institute of Technology

Proposes SeqMark, a sequence-level LLM output watermarking scheme improving detection F1 by 28% on constrained generation tasks

Output Integrity Attack nlp

PDF Code

benchmark arXiv Jan 11, 2026 · 12w ago

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Zheyuan Liu, Dongwhi Kim, Yixin Wan et al. · University of Notre Dame · University of California +2 more

Benchmarks multimodal LLM contextual safety against escalating and context-switch jailbreaks across 15 models and 5 guardrails

Prompt Injection multimodalnlpvision

PDF Code

attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp

PDF

defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp

PDF

Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM's own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

llm CSIRO’s Data61 · The University of Melbourne · Monash University +1 more

PDF arXiv DOI

attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp

1 citations PDF Code

defense arXiv Nov 12, 2025 · Nov 2025

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Tiansheng Huang, Virat Shejwalkar, Oscar Chang et al. · Georgia Institute of Technology · Google DeepMind +1 more

Defends audio language models against representation-drift-based audio jailbreaks using robust reasoning training

Input Manipulation Attack Prompt Injection audionlp

PDF

defense arXiv Nov 10, 2025 · Nov 2025

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Tyler Slater · Georgia Institute of Technology

Self-adapting runtime safety framework autonomously synthesizes new jailbreak defenses from breach feedback, cutting LLM ASR from 100% to 45.58%

Prompt Injection nlp

PDF

benchmark arXiv Oct 19, 2025 · Oct 2025

Watermark Robustness and Radioactivity May Be at Odds in Federated Learning

Leixu Huang, Zedian Shao, Teodora Baluta · Georgia Institute of Technology

Reveals that robust aggregation in federated LLM fine-tuning defeats radioactive content watermarks, exposing a provenance-robustness trade-off

Output Integrity Attack nlpfederated-learning

PDF

benchmark arXiv Oct 16, 2025 · Oct 2025

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi, Pinxian Lu, Abdulkadir Erol et al. · Georgia State University · Georgia Institute of Technology +1 more

Benchmarks multi-turn jailbreak attacks on LLM agents via memory, planning, and fine-tuning to elicit online harassment

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

llm Georgia State University · Georgia Institute of Technology · University of California

PDF arXiv DOI

attack arXiv Oct 15, 2025 · Oct 2025

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Xin Zhao, Xiaojun Chen, Bingshan Liu et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +2 more

Backdoor attack exploiting MoE routing preferences in LLMs to hijack expert pathways with up to 100% attack success rate

Model Poisoning nlp

PDF

defense arXiv Oct 11, 2025 · Oct 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Guozhi Liu, Qi Mu, Tiansheng Huang et al. · South China University of Technology · Ltd. +4 more

Curates safety-critical alignment data subsets to harden LLMs against harmful fine-tuning attacks while cutting training time by ~57%

Transfer Learning Attack Prompt Injection nlp

2 citations 1 influentialPDF Code

attack arXiv Oct 2, 2025 · Oct 2025

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al. · Georgia Institute of Technology · Oracle AI +1 more

RL + tree search framework discovers multi-turn jailbreak strategies achieving 81.5% ASR across 12 LLMs including Claude-4-Sonnet

Prompt Injection nlp

PDF

defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp

7 citations PDF

attack arXiv Sep 29, 2025 · Sep 2025

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

Ravikumar Balakrishnan, Mansi Phute · HiddenLayer Inc. · Georgia Institute of Technology

Optimizes adversarial images that steer VLM alignment behaviors like refusal and sycophancy without runtime model internals access

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF

attack arXiv Sep 3, 2025 · Sep 2025

ANNIE: Be Careful of Your Robots

Yiyang Huang, Zixuan Wang, Zishen Wan et al. · Chinese Academy of Sciences · Georgia Institute of Technology +1 more

Adversarial visual perturbations on VLA robot models cause ISO-defined safety violations with 50%+ success, validated on physical robots

Input Manipulation Attack Prompt Injection visionmultimodalreinforcement-learningnlp

PDF Code

attack arXiv Aug 11, 2025 · Aug 2025

VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Mansi Phute, Ravikumar Balakrishnan · Georgia Institute of Technology · Inc

Adversarial visual inputs steer VLM safety behaviors — bypassing refusal and inducing sycophancy — without model runtime access

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

Loading more papers…

Latest papers

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Auditing Language Model Unlearning via Information Decomposition

Verifying Local Robustness of Pruned Safety-Critical Networks

Semantic Differentiation for Tackling Challenges in Watermarking Low-Entropy Constrained Generation Outputs

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Aggressive Compression Enables LLM Weight Theft

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Watermark Robustness and Radioactivity May Be at Odds in Federated Learning

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Large Reasoning Models Learn Better Alignment from Flawed Thinking

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

ANNIE: Be Careful of Your Robots

VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue