benchmark 2026

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Trapoom Ukarapol ^1,2, Nut Chukamphaeng ³, Kunat Pipatanakul ^1,4, Pakhapoom Sarapat ¹

¹ SCB DataX

² Tsinghua University

³ SCBX

⁴ SCB

0 citations

Published on arXiv

2603.04992

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Thai-specific culturally-contextualized attacks achieve consistently higher ASR than general Thai-language attacks across 24 LLMs; closed-source models outperform open-source models in safety; ThaiSafetyClassifier matches GPT-4.1 judgments at 84.4% weighted F1.

ThaiSafetyBench

Novel technique introduced

The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench - ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench - ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier - ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard

Key Contributions

ThaiSafetyBench: 1,954 Thai-language malicious prompts spanning six risk areas with hierarchical taxonomy, including culturally-grounded Thai-specific attacks
Comprehensive safety evaluation of 24 LLMs (commercial, multilingual open-source, SEA-tuned, Thai-tuned) using GPT-4.1 and Gemini-2.5-Pro as LLM-as-a-judge, with a public leaderboard
ThaiSafetyClassifier: a fine-tuned DeBERTaV3-based harmful response classifier achieving 84.4% weighted F1 that matches GPT-4.1 judgments at lower cost

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

ThaiSafetyBench

Applications

thai-language llm deploymentllm safety evaluation

Read PDF arXiv Code

ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Many-Turn Jailbreaking

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring

Say It Differently: Linguistic Styles as Jailbreak Vectors

Small Symbols, Big Risks: Exploring Emoticon Semantic Confusion in Large Language Models

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG