The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability - adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a theoretically grounded approach to AI control for code generation tasks, though practical deployment requires addressing the high false positive rates observed in initial evaluations.

Key Contributions

Cross-Trace Verification Protocol (CTVP): detects backdoored code-generating LLMs by querying the model for execution traces across semantically equivalent program variants (semantic orbit) and flagging inconsistencies indicative of hidden triggers
Adversarial Robustness Quotient (ARQ): a metric quantifying verification overhead relative to single-trace inference, showing exponential growth with orbit size
Information-theoretic non-gamifiability bounds demonstrating adversaries cannot escape detection through additional training due to fundamental space complexity constraints

🛡️ Threat Analysis

Model Poisoning

CTVP is explicitly designed to detect backdoors and hidden malicious behavior in untrusted code-generating LLMs — a direct defense against model poisoning/trojans. The threat model is an LLM that has been backdoored to generate malicious code under specific triggering conditions, and the paper proposes semantic orbit analysis to expose this hidden targeted behavior.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargetedblack_box

Datasets

curated corpus of benign and adversarial code samples (unnamed)

Applications

2025 1 cit.

Model Poisoning

83%

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Localizing Malicious Outputs from CodeLLM

SLIP: Soft Label Mechanism and Key-Extraction-Guided CoT-based Defense Against Instruction Backdoor in APIs

Semantic Consensus Decoding: Backdoor Defense for Verilog Code Generation

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

MBTSAD: Mitigating Backdoors in Language Models Based on Token Splitting and Attention Distillation