Latest papers

10 papers
defense arXiv Jan 22, 2026 · 10w ago

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs

Khoa Nguyen, Khiem Ton, NhatHai Phan et al. · New Jersey Institute of Technology · Hamad Bin Khalifa University +2 more

Defends LLM code generation prompts from cloud reconstruction via embedding-level local differential privacy and a randomized tokenizer

Model Inversion Attack Sensitive Information Disclosure nlp
1 citations 1 influentialPDF
attack arXiv Nov 27, 2025 · Nov 2025

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Mohaiminul Al Nahian, Abeer Matar A. Almalky, Gamana Aragonda et al. · SUNY Binghamton · New Jersey Institute of Technology +1 more

Injects Trojan behavior into LLMs via a single KV-cache bit-flip, leaving no traces in weights or inputs

Model Poisoning nlp
PDF
defense arXiv Nov 22, 2025 · Nov 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more

Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization

Transfer Learning Attack Prompt Injection nlp
1 citations PDF
attack arXiv Nov 9, 2025 · Nov 2025

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Haiyan Zhao, Zirui He, Fan Yang et al. · New Jersey Institute of Technology · Wake Forest University +1 more

Inverts LLM last-token representations to reconstruct original input text, recovering over half of 16-token sequence information

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
attack arXiv Oct 24, 2025 · Oct 2025

$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Kieu Dang, Phung Lai, NhatHai Phan et al. · University at Albany · New Jersey Institute of Technology +2 more

LDP noise injection during fine-tuning steals LLM behavior from APIs while evading watermark detectors, achieving 96.95% attack success rate

Model Theft Output Integrity Attack Model Theft nlp
2 citations PDF Code
defense arXiv Oct 3, 2025 · Oct 2025

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee et al. · New Jersey Institute of Technology · Heritage Institute of Technology

Proposes CS-RLHF, a penalty-based constrained RLHF framework offering certifiable safety and 5x jailbreak resistance over Lagrangian baselines

Prompt Injection nlpreinforcement-learning
PDF Code
defense arXiv Sep 30, 2025 · Sep 2025

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Tuan Nguyen, Naseem Khan, Khang Tran et al. · Qatar Computing Research Institute · New Jersey Institute of Technology

Novel RL algorithm aligns VLM paragraph-level reasoning with visual evidence to improve deepfake detection accuracy

Output Integrity Attack visionmultimodalnlp
PDF
defense arXiv Sep 28, 2025 · Sep 2025

Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment

Pu Huang, Shouguang Wang, Siya Yao et al. · Zhejiang Gongshang University · New Jersey Institute of Technology

Novel speech deepfake detector combining information bottleneck and confidence-aware adversarial alignment for generalizable detection across unseen spoofing methods

Output Integrity Attack audio
PDF
defense arXiv Sep 11, 2025 · Sep 2025

CryptGNN: Enabling Secure Inference for Graph Neural Networks

Pritam Sen, Yao Ma, Cristian Borcea · New Jersey Institute of Technology · Rensselaer Polytechnic Institute

SMPC-based secure GNN inference framework that protects model parameters from clients and client inputs from cloud providers

Model Theft graph
PDF
defense arXiv Aug 19, 2025 · Aug 2025

FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks

Nicolò Romandini, Cristian Borcea, Rebecca Montanari et al. · University of Bologna · New Jersey Institute of Technology

Pruning-based federated unlearning defense that removes malicious client influence from FL global models after label-flipping and backdoor poisoning attacks

Data Poisoning Attack Model Poisoning federated-learning
PDF