Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models
Published on arXiv
2602.18514
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Reasoning models (Qwen 3 30B) exhibit a dangerous duality: they produce more persuasive strategic rationalizations for simple injection attacks, yet leak injection logic into output when processing complex adversarial instructions, making those attacks more humanly detectable than the hallucination-based failures of standard models.
Meta-Cognitive Leakage
Novel technique introduced
As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their ability to self-correct, emerging research suggests these capabilities may enable more sophisticated alignment failures. This qualitative Red-Teaming case study challenges the safety-through-reasoning premise using the Qwen 3 30B architecture. By subjecting both a standard instruction-tuned model and a reasoning-enhanced model to a "Trojan Horse" curriculum vitae, distinct failure modes are observed. The results suggest a complex trade-off: while the Standard Model resorted to brittle hallucinations to justify simple attacks and filtered out illogical constraints in complex scenarios, the Reasoning Model displayed a dangerous duality. It employed advanced strategic reframing to make simple attacks highly persuasive, yet exhibited "Meta-Cognitive Leakage" when faced with logically convoluted commands. This study highlights a failure mode where the cognitive load of processing complex adversarial instructions causes the injection logic to be unintentionally printed in the final output, rendering the attack more detectable by humans than in Standard Models.
Key Contributions
- Demonstrates that IPI via a 'Trojan Horse' CV can successfully hijack LLM-based HR decision-making pipelines in both standard and reasoning model variants
- Identifies 'Meta-Cognitive Leakage' — a novel failure mode where cognitively overloaded reasoning models unintentionally expose injection logic in their final output, making complex attacks more detectable than in standard models
- Challenges the 'safety-through-reasoning' hypothesis by showing that CoT capability enables reasoning models to construct more persuasive, strategically reframed justifications for simple adversarial commands