Hacking Neural Evaluation Metrics with Single Hub Text

Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.

Key Contributions

Discovery that the hubness problem manifests in discrete text space for neural evaluation metrics, enabling a single text to fool any test case
A two-stage method combining embedding-space hub optimization with inversion model decoding and discrete local search to find adversarial hub texts
Empirical demonstration that a single hub text outperforms M2M100 per-sentence translation on WMT'24 En-Ja (79.1 COMET%) and En-De (67.8 COMET%), generalizing across multiple language pairs

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is finding adversarial inputs (hub texts) in discrete text space that consistently cause a neural evaluation metric (COMET) to output inflated quality scores at inference time, regardless of test case — a direct input manipulation attack on a neural model using embedding-space optimization followed by discrete local search.

Details

Domains

nlp

Model Types

transformer

Threat Tags

white_boxinference_timetargeted

Datasets

WMT'23WMT'24

Applications

2026 0 cit.

Input Manipulation Attack

75%

Hacking Neural Evaluation Metrics with Single Hub Text

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evaluating the Robustness of a Production Malware Detection System to Transferable Adversarial Attacks

One Word is Enough: Minimal Adversarial Perturbations for Neural Text Ranking

Rerouting LLM Routers

RedHerring Attack: Testing the Reliability of Attack Detection

Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models

Semantics-Preserving Evasion of LLM Vulnerability Detectors

Text Adversarial Attacks with Dynamic Outputs

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems