Latest papers

21 papers
defense arXiv Mar 28, 2026 · 9d ago

The Geometry of Robustness: Optimizing Loss Landscape Curvature and Feature Manifold Alignment for Robust Finetuning of Vision-Language Models

Shivang Chopra, Shaunak Halbe, Chengyue Huan et al. · Georgia Institute of Technology

Fine-tuning framework that improves VLM adversarial robustness by 13.5% while preserving accuracy through loss curvature and feature alignment

Input Manipulation Attack visionmultimodal
PDF
benchmark arXiv Mar 11, 2026 · 26d ago

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah, Jing Huang, Keerthiram Murugesan et al. · Georgia Institute of Technology · Stanford University +1 more

Exposes LLM unlearning brittleness by showing multi-hop and alias queries recover supposedly forgotten information missed by static benchmarks

Sensitive Information Disclosure nlp
PDF Code
benchmark arXiv Jan 21, 2026 · 10w ago

Auditing Language Model Unlearning via Information Decomposition

Anmol Goel, Alan Ritter, Iryna Gurevych · Technical University of Darmstadt · National Research Center for Applied Cybersecurity ATHENE +1 more

Audits LLM unlearning via Partial Information Decomposition, revealing residual training data remains vulnerable to adversarial reconstruction attacks

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
benchmark arXiv Jan 19, 2026 · 11w ago

Verifying Local Robustness of Pruned Safety-Critical Networks

Minh Le, Phuong Cao · Georgia Institute of Technology · University of Illinois Urbana-Champaign

Empirically shows pruning ratio non-linearly affects formal L∞ adversarial robustness certificates in safety-critical vision models

Input Manipulation Attack vision
PDF
defense arXiv Jan 14, 2026 · 11w ago

Semantic Differentiation for Tackling Challenges in Watermarking Low-Entropy Constrained Generation Outputs

Nghia T. Le, Alan Ritter, Kartik Goyal · Georgia Institute of Technology

Proposes SeqMark, a sequence-level LLM output watermarking scheme improving detection F1 by 28% on constrained generation tasks

Output Integrity Attack nlp
PDF Code
benchmark arXiv Jan 11, 2026 · 12w ago

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues

Zheyuan Liu, Dongwhi Kim, Yixin Wan et al. · University of Notre Dame · University of California +2 more

Benchmarks multimodal LLM contextual safety against escalating and context-switch jailbreaks across 15 models and 5 guardrails

Prompt Injection multimodalnlpvision
PDF Code
attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp
PDF
defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp
PDF
attack arXiv Dec 1, 2025 · Dec 2025

The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search

Rongzhe Wei, Peizhi Niu, Xinjie Shen et al. · Georgia Institute of Technology · University of Illinois Urbana-Champaign +4 more

Decomposes harmful requests into innocuous sub-queries via tree search to jailbreak commercial LLM guardrails at 95%+ success

Prompt Injection nlp
1 citations PDF Code
defense arXiv Nov 12, 2025 · Nov 2025

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Tiansheng Huang, Virat Shejwalkar, Oscar Chang et al. · Georgia Institute of Technology · Google DeepMind +1 more

Defends audio language models against representation-drift-based audio jailbreaks using robust reasoning training

Input Manipulation Attack Prompt Injection audionlp
PDF
defense arXiv Nov 10, 2025 · Nov 2025

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Tyler Slater · Georgia Institute of Technology

Self-adapting runtime safety framework autonomously synthesizes new jailbreak defenses from breach feedback, cutting LLM ASR from 100% to 45.58%

Prompt Injection nlp
PDF
benchmark arXiv Oct 19, 2025 · Oct 2025

Watermark Robustness and Radioactivity May Be at Odds in Federated Learning

Leixu Huang, Zedian Shao, Teodora Baluta · Georgia Institute of Technology

Reveals that robust aggregation in federated LLM fine-tuning defeats radioactive content watermarks, exposing a provenance-robustness trade-off

Output Integrity Attack nlpfederated-learning
PDF
benchmark arXiv Oct 16, 2025 · Oct 2025

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Trilok Padhi, Pinxian Lu, Abdulkadir Erol et al. · Georgia State University · Georgia Institute of Technology +1 more

Benchmarks multi-turn jailbreak attacks on LLM agents via memory, planning, and fine-tuning to elicit online harassment

Transfer Learning Attack Prompt Injection nlp
1 citations PDF
attack arXiv Oct 15, 2025 · Oct 2025

Who Speaks for the Trigger? Dynamic Expert Routing in Backdoored Mixture-of-Experts Transformers

Xin Zhao, Xiaojun Chen, Bingshan Liu et al. · Chinese Academy of Sciences · State Key Laboratory of Cyberspace Security Defense +2 more

Backdoor attack exploiting MoE routing preferences in LLMs to hijack expert pathways with up to 100% attack success rate

Model Poisoning nlp
PDF
defense arXiv Oct 11, 2025 · Oct 2025

Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning

Guozhi Liu, Qi Mu, Tiansheng Huang et al. · South China University of Technology · Ltd. +4 more

Curates safety-critical alignment data subsets to harden LLMs against harmful fine-tuning attacks while cutting training time by ~57%

Transfer Learning Attack Prompt Injection nlp
2 citations 1 influentialPDF Code
attack arXiv Oct 2, 2025 · Oct 2025

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks

Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar et al. · Georgia Institute of Technology · Oracle AI +1 more

RL + tree search framework discovers multi-turn jailbreak strategies achieving 81.5% ASR across 12 LLMs including Claude-4-Sonnet

Prompt Injection nlp
PDF
defense arXiv Oct 1, 2025 · Oct 2025

Large Reasoning Models Learn Better Alignment from Flawed Thinking

ShengYun Peng, Eric Smith, Ivan Evtimov et al. · Meta · Georgia Institute of Technology +1 more

Defends LLMs against chain-of-thought jailbreaks by RL-training models to self-correct injected flawed reasoning premises

Prompt Injection nlp
7 citations PDF
attack arXiv Sep 29, 2025 · Sep 2025

VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models

Ravikumar Balakrishnan, Mansi Phute · HiddenLayer Inc. · Georgia Institute of Technology

Optimizes adversarial images that steer VLM alignment behaviors like refusal and sycophancy without runtime model internals access

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
attack arXiv Sep 3, 2025 · Sep 2025

ANNIE: Be Careful of Your Robots

Yiyang Huang, Zixuan Wang, Zishen Wan et al. · Chinese Academy of Sciences · Georgia Institute of Technology +1 more

Adversarial visual perturbations on VLA robot models cause ISO-defined safety violations with 50%+ success, validated on physical robots

Input Manipulation Attack Prompt Injection visionmultimodalreinforcement-learningnlp
PDF Code
attack arXiv Aug 11, 2025 · Aug 2025

VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

Mansi Phute, Ravikumar Balakrishnan · Georgia Institute of Technology · Inc

Adversarial visual inputs steer VLM safety behaviors — bypassing refusal and inducing sycophancy — without model runtime access

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
Loading more papers…