attack 2026

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Manuel Wirth

University of Mannheim

0 citations · 13 references · arXiv (Cornell University)

Published on arXiv

2602.18514

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reasoning models (Qwen 3 30B) exhibit a dangerous duality: they produce more persuasive strategic rationalizations for simple injection attacks, yet leak injection logic into output when processing complex adversarial instructions, making those attacks more humanly detectable than the hallucination-based failures of standard models.

Meta-Cognitive Leakage

Novel technique introduced

As Large Language Models (LLMs) are increasingly integrated into automated decision-making pipelines, specifically within Human Resources (HR), the security implications of Indirect Prompt Injection (IPI) become critical. While a prevailing hypothesis posits that "Reasoning" or "Chain-of-Thought" Models possess safety advantages due to their ability to self-correct, emerging research suggests these capabilities may enable more sophisticated alignment failures. This qualitative Red-Teaming case study challenges the safety-through-reasoning premise using the Qwen 3 30B architecture. By subjecting both a standard instruction-tuned model and a reasoning-enhanced model to a "Trojan Horse" curriculum vitae, distinct failure modes are observed. The results suggest a complex trade-off: while the Standard Model resorted to brittle hallucinations to justify simple attacks and filtered out illogical constraints in complex scenarios, the Reasoning Model displayed a dangerous duality. It employed advanced strategic reframing to make simple attacks highly persuasive, yet exhibited "Meta-Cognitive Leakage" when faced with logically convoluted commands. This study highlights a failure mode where the cognitive load of processing complex adversarial instructions causes the injection logic to be unintentionally printed in the final output, rendering the attack more detectable by humans than in Standard Models.

Key Contributions

Demonstrates that IPI via a 'Trojan Horse' CV can successfully hijack LLM-based HR decision-making pipelines in both standard and reasoning model variants
Identifies 'Meta-Cognitive Leakage' — a novel failure mode where cognitively overloaded reasoning models unintentionally expose injection logic in their final output, making complex attacks more detectable than in standard models
Challenges the 'safety-through-reasoning' hypothesis by showing that CoT capability enables reasoning models to construct more persuasive, strategically reframed justifications for simple adversarial commands

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Applications

hr automated recruiting pipelinesapplicant tracking systemsllm-based resume screening

Read PDF arXiv DOI

Trojan Horses in Recruiting: A Red-Teaming Case Study on Indirect Prompt Injection in Standard vs. Reasoning Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Multi-Stream Perturbation Attack: Breaking Safety Alignment of Thinking LLMs Through Concurrent Task Interference

Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Emoji-Based Jailbreaking of Large Language Models

PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Let the Bees Find the Weak Spots: A Path Planning Perspective on Multi-Turn Jailbreak Attacks against LLMs

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs