benchmark 2025

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Yujia Hu 1, Ming Shan Hee 1, Preslav Nakov 2, Roy Ka-Wei Lee 1

0 citations

α

Published on arXiv

2509.15260

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Extensive experiments reveal critical gaps in safety guardrails of state-of-the-art multilingual LLMs when prompted in Singapore's low-resource languages (Singlish, Chinese, Malay, Tamil).

SGToxicGuard

Novel technique introduced


The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}


Key Contributions

  • SGToxicGuard dataset and evaluation framework covering Singlish, Chinese, Malay, and Tamil for benchmarking LLM safety in multilingual low-resource settings
  • Red-teaming methodology spanning three real-world scenarios (conversation, QA, content composition) to systematically expose safety guardrail gaps in state-of-the-art LLMs
  • Actionable insights into cultural sensitivity and toxicity failures of multilingual LLMs in Singapore's linguistic context

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Datasets
SGToxicGuard
Applications
multilingual chatbotcontent moderationllm safety evaluation