benchmark 2025

Silenced Biases: The Dark Side LLMs Learned to Refuse

Rom Himelstein , Amit LeVi , Brit Youngmann , Yaniv Nemcovsky , Avi Mendelson

2 citations · 67 references · arXiv

α

Published on arXiv

2511.03369

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Safety-aligned LLMs exhibit severe latent biases (e.g., 460–810% fairness deviation on race and nationality questions for Llama-3.1-8B-Instruct) that refusal-based evaluation completely masks.

Silenced Bias Benchmark (SBB)

Novel technique introduced


Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.


Key Contributions

  • Introduces the concept of 'silenced biases' — unfair preferences encoded in LLMs' latent space that are concealed by safety alignment without being resolved
  • Proposes the Silenced Bias Benchmark (SBB), which uses activation steering to suppress refusals and expose latent biased preferences during QA evaluation
  • Demonstrates alarming gaps between safety-aligned models' direct responses and their underlying biased behaviors across multiple LLMs (e.g., 460–810% fairness deviation on Llama-3.1-8B-Instruct)

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timewhite_box
Datasets
SBB (authors' own benchmark)
Applications
llm fairness evaluationsafety alignment auditing