benchmark 2025

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

Chantal Shaib 1,2, Vinith M. Suriyakumar 2, Levent Sagun 1, Byron C. Wallace 3, Marzyeh Ghassemi 2

2 citations · 44 references · arXiv

α

Published on arXiv

2509.21155

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Syntactic-domain spurious correlations can reduce entity knowledge task accuracy by ~49% and be exploited to bypass safety refusals in both open (OLMo-2-7B Instruct) and closed (GPT-4o) LLMs via structurally coherent but semantically incoherent prompts.


For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates -- frequent sequences of Part-of-Speech (PoS) tags -- are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.


Key Contributions

  • Characterizes spurious syntactic-domain correlations in LLM training data and shows they degrade semantic understanding (mean accuracy 0.51±0.06 on entity knowledge tasks in OLMo-2 1B–13B)
  • Introduces an evaluation framework to detect syntactic-domain spurious correlations in trained models, validated on FlanV2 with OLMo-2, Llama-4-Maverick, and GPT-4o
  • Demonstrates a case study showing these correlations can be exploited to bypass safety refusals in OLMo-2-7B Instruct and GPT-4o across domains including illegal activities and medical misinformation

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
FlanV2Wikipedia (synthetic entity-pair dataset)
Applications
llm safety alignmentinstruction followingentity knowledge tasks