Latest papers

17 papers
benchmark arXiv Feb 28, 2026 · 5w ago

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Ruihao Pan, Suhang Wang · Pennsylvania State University

Shows LLM unlearning fails under multi-turn interaction; self-correction and dialogue history recover supposedly forgotten hazardous or private knowledge

Prompt Injection Sensitive Information Disclosure nlp
PDF
tool arXiv Feb 13, 2026 · 7w ago

GPTZero: Robust Detection of LLM-Generated Texts

George Alexandru Adam, Alexander Cui, Edwin Thomas et al. · GPTZero · University of Waterloo +3 more

GPTZero detects LLM-generated text with a hierarchical multi-task architecture and adversarial robustness via red teaming

Output Integrity Attack nlp
PDF
attack arXiv Feb 8, 2026 · 8w ago

Robustness of Vision Language Models Against Split-Image Harmful Input Attacks

Md Rafi Ur Rashid, MD Sadik Hossain Shanto, Vishnu Asutosh Dasu et al. · Pennsylvania State University · Bangladesh University of Engineering and Technology

Exploits VLM safety alignment gaps using split-image inputs to jailbreak modern VLMs with 60% better transfer than baselines

Input Manipulation Attack Prompt Injection visionmultimodalnlp
PDF
benchmark arXiv Jan 27, 2026 · 9w ago

Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu et al. · The University of Texas at Austin · New York University +3 more

Benchmarks VLM susceptibility to persuasive conflicting text prompts that override visual evidence, finding 48% average accuracy drop

Prompt Injection visionnlpmultimodal
PDF
attack arXiv Jan 21, 2026 · 10w ago

Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems

Shuhua Yang, Jiahao Zhang, Yilong Wang et al. · Pennsylvania State University

Black-box agentic attack that reconstructs up to 90% of a GraphRAG system's hidden knowledge graph via adaptive queries

Model Theft Sensitive Information Disclosure nlpgraph
PDF
attack arXiv Jan 20, 2026 · 10w ago

Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks

Mohammad Shamim Ahsan, Peng Liu · Pennsylvania State University

Crafts benign MQTT packets that fool ML-based NIDSs into false positives, overwhelming SOC analysts with fabricated alerts

Input Manipulation Attack tabular
PDF
defense arXiv Dec 18, 2025 · Dec 2025

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Jirui Yang, Hengqi Guo, Zhihui Lu et al. · Fudan University · Ant Group +1 more

Defends LLMs against harmful prompts by comparing refusal vs. agreement prefix log-probabilities with near-zero inference overhead

Prompt Injection nlp
PDF
attack arXiv Nov 23, 2025 · Nov 2025

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

Yanting Wang, Runpeng Geng, Jinghui Chen et al. · Pennsylvania State University

Combines gradient-based suffix optimization with semantic template optimization to jailbreak LLMs more effectively than either alone

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Nov 22, 2025 · Nov 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more

Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization

Transfer Learning Attack Prompt Injection nlp
1 citations PDF
attack arXiv Nov 20, 2025 · Nov 2025

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan, Yuhan Xie, Yixin Zhang et al. · Westlake University · Pennsylvania State University +2 more

Multimodal adversarial attack framework targeting VLA robots via visual patches, gradient-based text, and cross-modal misalignment attacks

Input Manipulation Attack Prompt Injection visionnlpmultimodal
1 citations PDF
defense arXiv Nov 17, 2025 · Nov 2025

InfoDecom: Decomposing Information for Defending Against Privacy Leakage in Split Inference

Ruijun Deng, Zhihui Lu, Qiang Duan · Fudan University · Pennsylvania State University

Defends split inference against data reconstruction attacks by decomposing redundant smashed-data information before injecting calibrated privacy noise

Model Inversion Attack vision
PDF Code
defense arXiv Nov 15, 2025 · Nov 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University

Defends LLMs against jailbreaks by fixing gradient-decay-induced incomplete safety alignment via base-favored token penalties and teacher distillation

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Oct 15, 2025 · Oct 2025

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Wei Zou, Yupei Liu, Yanting Wang et al. · Pennsylvania State University · Duke University

Detects prompt injection in LLM applications using residual-stream representations and a lightweight linear classifier

Prompt Injection nlp
PDF
benchmark arXiv Sep 6, 2025 · Sep 2025

Benchmarking Robust Aggregation in Decentralized Gradient Marketplaces

Zeyu Song, Sainyam Galhotra, Shagufta Mehnaz · Pennsylvania State University · Cornell University

Benchmarks robust aggregation defenses against Byzantine and Sybil attacks in decentralized federated gradient marketplaces with new economic metrics

Data Poisoning Attack federated-learning
PDF
attack arXiv Aug 26, 2025 · Aug 2025

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

Runpeng Geng, Yanting Wang, Ying Chen et al. · Pennsylvania State University

Injects 100 optimized adversarial documents into a RAG knowledge base to hijack LLM outputs for 2,000+ diverse queries

Input Manipulation Attack Prompt Injection nlp
PDF
tool arXiv Aug 15, 2025 · Aug 2025

SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis

Haitong Luo, Weiyao Zhang, Suhang Wang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Detects LLM-generated text via spectral energy of token log-probability sequences using DFT/STFT, outperforming SOTA at half the runtime

Output Integrity Attack nlp
PDF Code
defense arXiv Aug 5, 2025 · Aug 2025

AttnTrace: Attention-based Context Traceback for Long-Context LLMs

Yanting Wang, Runpeng Geng, Ying Chen et al. · Pennsylvania State University

Attention-weight traceback method that pinpoints injected instructions in long-context LLMs, improving prompt injection detection in RAG pipelines

Prompt Injection nlp
PDF Code