Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation
Published on arXiv
2512.23837
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Attention-layer-derived adversarial token substitutions cause measurable drops in LLM evaluator performance on argument quality assessment while preserving semantic similarity, though substitutions from certain layer-position combinations introduce grammatical degradation that limits practical effectiveness.
Adversarial Lens
Novel technique introduced
Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.
Key Contributions
- Attention-based token substitution: uses intermediate layer token predictions (via LogitLens-style projection) as principled adversarial perturbations to the evaluated text
- Attention-based conditional generation: constructs semantically plausible adversarial inputs from model-internal representations without requiring gradient access
- Empirical evaluation on ArgQuality showing measurable drops in LLM evaluator scores, alongside analysis of grammatical degradation trade-offs across layers and token positions
🛡️ Threat Analysis
Proposes two novel adversarial example generation methods (attention-based token substitution and conditional generation) that craft token-level text perturbations causing incorrect outputs in LLM-as-judge evaluation pipelines at inference time; explicitly frames the approach in relation to FGSM and gradient-based adversarial attacks and demonstrates measurable drops in evaluation performance.