defense 2025

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

Rui Melo 1,2, Rui Abreu 2,3, Corina S. Pasareanu 1

1 citations · 77 references · arXiv

α

Published on arXiv

2510.01288

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Position encoding perturbations serve as a training-free probe that detects LLM misbehaviours — including safety violations, toxicity, and backdoor activations — across multiple state-of-the-art LLMs with computational efficiency.

Microsaccade-Inspired Probing (MiP)

Novel technique introduced


We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.


Key Contributions

  • Microsaccade-inspired probing method using lightweight position encoding perturbations to elicit latent misbehaviour signals from LLMs
  • Unified detection framework requiring no fine-tuning or task-specific supervision that generalises across factuality, safety, toxicity, and backdoor failure modes
  • Empirical demonstration that pretrained LLMs internally encode evidence sufficient to flag their own failures, surfaced via positional perturbations

🛡️ Threat Analysis

Model Poisoning

Explicitly detects backdoor attacks as one of the core evaluated threat types; the probing method surfaces backdoor-induced misbehaviours using position encoding perturbations.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Applications
large language modelssafety monitoringbackdoor detectiontoxicity detectionfactuality assessment