attack 2025

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

Rachneet Sachdeva , Rima Hazra , Iryna Gurevych

0 citations

α

Published on arXiv

2501.01872

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

POATE achieves approximately 44% higher attack success rate compared to existing jailbreak methods across six diverse LLM families of varying parameter sizes

POATE

Novel technique introduced


Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model's defense against adversarial exploits.


Key Contributions

  • POATE jailbreak technique that combines Polar Opposite query generation and Adversarial Template construction to leverage contrastive reasoning for bypassing LLM safety alignment
  • Evaluation across six diverse LLM families demonstrating ~44% higher attack success rate than existing jailbreak methods
  • Intent-Aware CoT and Reverse Thinking CoT defense mechanisms that decompose queries to detect malicious intent and reason in reverse to reject harmful responses

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Applications
llm safety alignmentchatbot