tool 2025

Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Zhiyang Chen , Tara Saba , Xun Deng , Xujie Si , Fan Long

0 citations

α

Published on arXiv

2509.02372

Data Poisoning Attack

OWASP ML Top 10 — ML02

Training Data Poisoning

OWASP LLM Top 10 — LLM03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Innocuous prompts trigger malicious scam URL generation in 12.7%–43.8% of cases across seven 2025 production LLMs, while state-of-the-art guardrails detect fewer than 0.3% of such outputs

Scam2Prompt

Novel technique introduced


Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.


Key Contributions

  • Scam2Prompt: an automated pipeline that infers scam site intent and synthesizes innocuous developer-style prompts to audit production LLMs for malicious code generation at scale
  • Innoc2Scam-bench: a curated benchmark of 1,559 innocuous prompts that consistently elicit malicious scam URL generation across all tested LLMs
  • Large-scale evaluation across 11 production LLMs (including 7 released in 2025) showing 12.7–43.8% malicious code generation rates and <0.3% detection by state-of-the-art guardrails

🛡️ Threat Analysis

Data Poisoning Attack

Web-scale training data contains malicious scam endpoints that get absorbed by LLMs; the paper's central thesis is that this training data contamination directly causes LLMs to output malicious code — co-tagged per LLM03 requirement.


Details

Domains
nlp
Model Types
llm
Threat Tags
training_timeinference_timeblack_box
Datasets
Innoc2Scam-bench
Applications
llm code generationdeveloper toolingllm safety auditing