attack 2025

Black-Box Guardrail Reverse-engineering Attack

Hongwei Yao 1, Yun Xia 2, Shuo Shao 3, Haoran Shi 3, Tong Qiao 2, Cong Wang 1

0 citations · 35 references · arXiv

α

Published on arXiv

2511.04215

Model Theft

OWASP ML Top 10 — ML05

Model Theft

OWASP LLM Top 10 — LLM10

Key Finding

GRA achieves a guardrail rule matching rate exceeding 0.92 against ChatGPT, DeepSeek, and Qwen3 while requiring less than $85 in API query costs.

GRA (Guardrail Reverse-engineering Attack)

Novel technique introduced


Large language models (LLMs) increasingly employ guardrails to enforce ethical, legal, and application-specific constraints on their outputs. While effective at mitigating harmful responses, these guardrails introduce a new class of vulnerabilities by exposing observable decision patterns. In this work, we present the first study of black-box LLM guardrail reverse-engineering attacks. We propose Guardrail Reverse-engineering Attack (GRA), a reinforcement learning-based framework that leverages genetic algorithm-driven data augmentation to approximate the decision-making policy of victim guardrails. By iteratively collecting input-output pairs, prioritizing divergence cases, and applying targeted mutations and crossovers, our method incrementally converges toward a high-fidelity surrogate of the victim guardrail. We evaluate GRA on three widely deployed commercial systems, namely ChatGPT, DeepSeek, and Qwen3, and demonstrate that it achieves an rule matching rate exceeding 0.92 while requiring less than $85 in API costs. These findings underscore the practical feasibility of guardrail extraction and highlight significant security risks for current LLM safety mechanisms. Our findings expose critical vulnerabilities in current guardrail designs and highlight the urgent need for more robust defense mechanisms in LLM deployment.


Key Contributions

  • First systematic black-box guardrail reverse-engineering attack (GRA) using a reinforcement learning framework with genetic algorithm-driven data augmentation to clone victim guardrail policies
  • A legal-moral evaluation dataset for benchmarking guardrail extraction fidelity and alignment performance
  • Empirical demonstration on ChatGPT, DeepSeek, and Qwen3 showing >0.92 rule matching rate at under $85 API cost

🛡️ Threat Analysis

Model Theft

GRA is a model extraction attack: it iteratively queries the victim guardrail's API, collects input-output pairs, and trains a surrogate model that replicates the guardrail's decision-making policy — this is functionally identical to knockoff-nets-style model theft, applied to guardrail classifiers rather than task models.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
custom legal-moral evaluation datasetChatGPT APIDeepSeek APIQwen3 API
Applications
llm safety guardrailscommercial llm systems