defense 2025

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Yanxi Li , Ruocheng Shan

0 citations · 24 references · arXiv

α

Published on arXiv

2511.21752

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LDD restores a portion of accuracy lost to class-directive prompt injection across all nine evaluated LLMs, with semantically aligned alias labels (e.g., good/bad) outperforming unrelated symbol aliases (e.g., blue/yellow).

Label Disguise Defense (LDD)

Novel technique introduced


Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.


Key Contributions

  • Label Disguise Defense (LDD): a model-agnostic, training-free strategy that replaces true class labels with alias labels in few-shot prompts to sever the link between injected directives and model outputs
  • Empirical evaluation across 9 LLMs (GPT-5, GPT-4o, LLaMA3.2, Gemma3, Mistral variants) showing LDD restores accuracy degraded by class-directive injection attacks
  • Linguistic analysis demonstrating that semantically aligned alias pairs (e.g., good/bad) yield stronger robustness than semantically unrelated symbols (e.g., blue/yellow)

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Applications
sentiment classificationtext classification