benchmark 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu , Diptisha Samanta , Reza Soosahabi

0 citations

α

Published on arXiv

2604.01039

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Structured serialization formats (YAML, TOML) achieve attack success rates ≥0.7 for extracting system instructions that models refuse to disclose directly

LLM-EncodeGuard

Novel technique introduced


System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates (> 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.


Key Contributions

  • Automated evaluation framework for testing system instruction leakage via encoding-based extraction across 46 verified system instructions
  • Empirical study showing high attack success rates (≥0.7) for structured serialization formats that bypass direct refusal mechanisms
  • Mitigation strategy using one-shot instruction reshaping with Chain-of-Thought reasoning that reduces attack success without model retraining

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
46 verified system instructions across 4 LLM models
Applications
agentic ai applicationschatbot systemsllm-based enterprise systems