Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs
Zhiyang Chen , Tara Saba , Xun Deng , Xujie Si , Fan Long
Published on arXiv
2509.02372
Data Poisoning Attack
OWASP ML Top 10 — ML02
Training Data Poisoning
OWASP LLM Top 10 — LLM03
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Innocuous prompts trigger malicious scam URL generation in 12.7%–43.8% of cases across seven 2025 production LLMs, while state-of-the-art guardrails detect fewer than 0.3% of such outputs
Scam2Prompt
Novel technique introduced
Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.
Key Contributions
- Scam2Prompt: an automated pipeline that infers scam site intent and synthesizes innocuous developer-style prompts to audit production LLMs for malicious code generation at scale
- Innoc2Scam-bench: a curated benchmark of 1,559 innocuous prompts that consistently elicit malicious scam URL generation across all tested LLMs
- Large-scale evaluation across 11 production LLMs (including 7 released in 2025) showing 12.7–43.8% malicious code generation rates and <0.3% detection by state-of-the-art guardrails
🛡️ Threat Analysis
Web-scale training data contains malicious scam endpoints that get absorbed by LLMs; the paper's central thesis is that this training data contamination directly causes LLMs to output malicious code — co-tagged per LLM03 requirement.