defense 2025

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Yan Pang ¹, Wenlong Meng ¹, Xiaojing Liao ², Tianhao Wang ¹

¹ University of Virginia

² Indiana University Bloomington

0 citations

Published on arXiv

2509.07287

Output Integrity Attack

OWASP ML Top 10 — ML09

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Instrumented LLMs achieve over 90% phishing detection accuracy across four trigger-tag scenario variants, outperforming existing detection baselines.

Paladin

Novel technique introduced

With the rapid development of large language models, the potential threat of their malicious use, particularly in generating phishing content, is becoming increasingly prevalent. Leveraging the capabilities of LLMs, malicious users can synthesize phishing emails that are free from spelling mistakes and other easily detectable features. Furthermore, such models can generate topic-specific phishing messages, tailoring content to the target domain and increasing the likelihood of success. Detecting such content remains a significant challenge, as LLM-generated phishing emails often lack clear or distinguishable linguistic features. As a result, most existing semantic-level detection approaches struggle to identify them reliably. While certain LLM-based detection methods have shown promise, they suffer from high computational costs and are constrained by the performance of the underlying language model, making them impractical for large-scale deployment. In this work, we aim to address this issue. We propose Paladin, which embeds trigger-tag associations into vanilla LLM using various insertion strategies, creating them into instrumented LLMs. When an instrumented LLM generates content related to phishing, it will automatically include detectable tags, enabling easier identification. Based on the design on implicit and explicit triggers and tags, we consider four distinct scenarios in our work. We evaluate our method from three key perspectives: stealthiness, effectiveness, and robustness, and compare it with existing baseline methods. Experimental results show that our method outperforms the baselines, achieving over 90% detection accuracy across all scenarios.

Key Contributions

Trigger-tag paradigm that embeds associations between phishing-related triggers and detectable tags into LLMs via fine-tuning, creating 'instrumented LLMs' that self-annotate their phishing outputs
Four distinct scenarios combining implicit/explicit triggers and implicit/explicit tags, evaluated for stealthiness, effectiveness, and robustness
Achieves over 90% phishing detection accuracy across all scenarios, outperforming existing baseline detection methods

🛡️ Threat Analysis

Output Integrity Attack

Paladin embeds detectable tags into LLM-generated phishing content at output time — a form of content watermarking that enables identification of AI-generated harmful text. The core contribution is instrumenting the generator model so its outputs carry provenance markers, directly analogous to LLM output watermarking for content integrity.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Applications

phishing email detectionllm misuse preventionemail security

Read PDF arXiv

Paladin: Defending LLM-enabled Phishing Emails with a New Trigger-Tag Paradigm

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Risk Assessment and Security Analysis of Large Language Models

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Don't Walk the Line: Boundary Guidance for Filtered Generation

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

NeuronTune: Fine-Grained Neuron Modulation for Balanced Safety-Utility Alignment in LLMs

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check