ML Security Papers

Latest papers

10 papers

defense arXiv Jan 22, 2026 · 10w ago

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs

Khoa Nguyen, Khiem Ton, NhatHai Phan et al. · New Jersey Institute of Technology · Hamad Bin Khalifa University +2 more

Defends LLM code generation prompts from cloud reconstruction via embedding-level local differential privacy and a randomized tokenizer

Model Inversion Attack Sensitive Information Disclosure nlp

1 citations 1 influentialPDF

attack arXiv Nov 27, 2025 · Nov 2025

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Mohaiminul Al Nahian, Abeer Matar A. Almalky, Gamana Aragonda et al. · SUNY Binghamton · New Jersey Institute of Technology +1 more

Injects Trojan behavior into LLMs via a single KV-cache bit-flip, leaving no traces in weights or inputs

Model Poisoning nlp

PDF

Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

llm transformer SUNY Binghamton · New Jersey Institute of Technology · UNC Charlotte

PDF arXiv DOI

defense arXiv Nov 22, 2025 · Nov 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more

Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

attack arXiv Nov 9, 2025 · Nov 2025

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Haiyan Zhao, Zirui He, Fan Yang et al. · New Jersey Institute of Technology · Wake Forest University +1 more

Inverts LLM last-token representations to reconstruct original input text, recovering over half of 16-token sequence information

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

attack arXiv Oct 24, 2025 · Oct 2025

$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Kieu Dang, Phung Lai, NhatHai Phan et al. · University at Albany · New Jersey Institute of Technology +2 more

LDP noise injection during fine-tuning steals LLM behavior from APIs while evading watermark detectors, achieving 96.95% attack success rate

Model Theft Output Integrity Attack Model Theft nlp

2 citations PDF Code

defense arXiv Oct 3, 2025 · Oct 2025

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee et al. · New Jersey Institute of Technology · Heritage Institute of Technology

Proposes CS-RLHF, a penalty-based constrained RLHF framework offering certifiable safety and 5x jailbreak resistance over Lagrangian baselines

Prompt Injection nlpreinforcement-learning

PDF Code

defense arXiv Sep 30, 2025 · Sep 2025

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Tuan Nguyen, Naseem Khan, Khang Tran et al. · Qatar Computing Research Institute · New Jersey Institute of Technology

Novel RL algorithm aligns VLM paragraph-level reasoning with visual evidence to improve deepfake detection accuracy

Output Integrity Attack visionmultimodalnlp

PDF

defense arXiv Sep 28, 2025 · Sep 2025

Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment

Pu Huang, Shouguang Wang, Siya Yao et al. · Zhejiang Gongshang University · New Jersey Institute of Technology

Novel speech deepfake detector combining information bottleneck and confidence-aware adversarial alignment for generalizable detection across unseen spoofing methods

Output Integrity Attack audio

PDF

defense arXiv Sep 11, 2025 · Sep 2025

CryptGNN: Enabling Secure Inference for Graph Neural Networks

Pritam Sen, Yao Ma, Cristian Borcea · New Jersey Institute of Technology · Rensselaer Polytechnic Institute

SMPC-based secure GNN inference framework that protects model parameters from clients and client inputs from cloud providers

Model Theft graph

PDF

defense arXiv Aug 19, 2025 · Aug 2025

FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks

Nicolò Romandini, Cristian Borcea, Rebecca Montanari et al. · University of Bologna · New Jersey Institute of Technology

Pruning-based federated unlearning defense that removes malicious client influence from FL global models after label-flipping and backdoor poisoning attacks

Data Poisoning Attack Model Poisoning federated-learning

PDF

Federated Learning (FL) can be vulnerable to attacks, such as model poisoning, where adversaries send malicious local weights to compromise the global model. Federated Unlearning (FU) is emerging as a solution to address such vulnerabilities by selectively removing the influence of detected malicious contributors on the global model without complete retraining. However, unlike typical FU scenarios where clients are trusted and cooperative, applying FU with malicious and possibly colluding clients is challenging because their collaboration in unlearning their data cannot be assumed. This work presents FedUP, a lightweight FU algorithm designed to efficiently mitigate malicious clients' influence by pruning specific connections within the attacked model. Our approach achieves efficiency by relying only on clients' weights from the last training round before unlearning to identify which connections to inhibit. Isolating malicious influence is non-trivial due to overlapping updates from benign and malicious clients. FedUP addresses this by carefully selecting and zeroing the highest magnitude weights that diverge the most between the latest updates from benign and malicious clients while preserving benign information. FedUP is evaluated under a strong adversarial threat model, where up to 50%-1 of the clients could be malicious and have full knowledge of the aggregation process. We demonstrate the effectiveness, robustness, and efficiency of our solution through experiments across IID and Non-IID data, under label-flipping and backdoor attacks, and by comparing it with state-of-the-art (SOTA) FU solutions. In all scenarios, FedUP reduces malicious influence, lowering accuracy on malicious data to match that of a model retrained from scratch while preserving performance on benign data. FedUP achieves effective unlearning while consistently being faster and saving storage compared to the SOTA.

federated cnn University of Bologna · New Jersey Institute of Technology

PDF arXiv

Latest papers

NOIR: Privacy-Preserving Generation of Code with Open-Source LLMs

CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Rep2Text: Decoding Full Text from a Single LLM Token Representation

$δ$-STEAL: LLM Stealing Attack with Local Differential Privacy

Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Generalizable Speech Deepfake Detection via Information Bottleneck Enhanced Adversarial Alignment

CryptGNN: Enabling Secure Inference for Graph Neural Networks

FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue