defense 2025

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Yanxi Li , Ruocheng Shan

George Washington University

0 citations · 24 references · arXiv

Published on arXiv

2511.21752

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LDD restores a portion of accuracy lost to class-directive prompt injection across all nine evaluated LLMs, with semantically aligned alias labels (e.g., good/bad) outperforming unrelated symbol aliases (e.g., blue/yellow).

Label Disguise Defense (LDD)

Novel technique introduced

Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

Key Contributions

Label Disguise Defense (LDD): a model-agnostic, training-free strategy that replaces true class labels with alias labels in few-shot prompts to sever the link between injected directives and model outputs
Empirical evaluation across 9 LLMs (GPT-5, GPT-4o, LLaMA3.2, Gemma3, Mistral variants) showing LDD restores accuracy degraded by class-directive injection attacks
Linguistic analysis demonstrating that semantically aligned alias pairs (e.g., good/bad) yield stronger robustness than semantically unrelated symbols (e.g., blue/yellow)

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

sentiment classificationtext classification

Read PDF arXiv DOI

Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PISanitizer: Preventing Prompt Injection to Long-Context LLMs via Prompt Sanitization

PIShield: Detecting Prompt Injection Attacks via Intrinsic LLM Features

Securing AI Agents Against Prompt Injection Attacks

Mitigating Indirect Prompt Injection via Instruction-Following Intent Analysis

From static to adaptive: immune memory-based jailbreak detection for large language models

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG

Defend LLMs Through Self-Consciousness

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models