Latest papers

3 papers
attack arXiv Nov 16, 2025 · Nov 2025

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Yuting Tan, Yi Huang, Zhuo Li · hydrox.ai

Introduces compliance-only LLM backdoor using 'Sure' labels that generalize to harmful outputs when triggered at inference

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Sep 24, 2025 · Sep 2025

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

Huizhen Shu, Xuying Li, Zhuo Li · hydrox.ai

Defends LLMs against jailbreaks via VAE-supervised latent steering that selectively suppresses adversarial signals while preserving utility

Prompt Injection nlp
PDF
attack arXiv Aug 14, 2025 · Aug 2025

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Huizhen Shu, Xuying Li, Qirui Wang et al. · hydrox.ai · University of Washington +1 more

Jailbreaks LLMs by perturbing sparse autoencoder features in hidden layers to generate adversarial texts that evade safety defenses

Input Manipulation Attack Prompt Injection nlp
PDF