benchmark 2025

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin 1, Jirui Yang 2, Yukui Qiu 3, Hengqi Guo 2, Yubing Bao 2, Yao Guan 2

0 citations · 44 references · arXiv

α

Published on arXiv

2511.14195

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

JSS metric reproduces full red-teaming safety rankings across 40+ models and 20 attack strategies at less than 1% of token and runtime cost.

N-GLARE (JSS / APT)

Novel technique introduced


Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1\% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.


Key Contributions

  • N-GLARE: a non-generative LLM safety evaluator that operates entirely on latent representations without requiring text generation, enabling output-free safety diagnostics
  • Angular-Probabilistic Trajectories (APT) and the Jensen-Shannon Separability (JSS) metric that quantify geometric separation of hidden-layer dynamics under benign vs. jailbreak conditions
  • Empirical validation across 40+ models and 20 red-teaming strategies showing JSS rankings are highly consistent with traditional red-teaming at less than 1% of token and runtime cost

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
40+ LLMs including Qwen3-4B variants20 red-teaming strategy corpora
Applications
llm safety evaluationjailbreak robustness benchmarkingpost-training safety diagnostics