Scam2Prompt: A Scalable Framework for Auditing Malicious Scam Endpoints in Production LLMs

Large Language Models (LLMs) have become critical to modern software development, but their reliance on uncurated web-scale datasets for training introduces a significant security risk: the absorption and reproduction of malicious content. To systematically evaluate this risk, we introduce Scam2Prompt, a scalable automated auditing framework that identifies the underlying intent of a scam site and then synthesizes innocuous, developer-style prompts that mirror this intent, allowing us to test whether an LLM will generate malicious code in response to these innocuous prompts. In a large-scale study of four production LLMs (GPT-4o, GPT-4o-mini, Llama-4-Scout, and DeepSeek-V3), we found that Scam2Prompt's innocuous prompts triggered malicious URL generation in 4.24% of cases. To test the persistence of this security risk, we constructed Innoc2Scam-bench, a benchmark of 1,559 innocuous prompts that consistently elicited malicious code from all four initial LLMs. When applied to seven additional production LLMs released in 2025, we found the vulnerability is not only present but severe, with malicious code generation rates ranging from 12.7% to 43.8%. Furthermore, existing safety measures like state-of-the-art guardrails proved insufficient to prevent this behavior, with an overall detection rate of less than 0.3%.

Key Contributions

Scam2Prompt: an automated pipeline that infers scam site intent and synthesizes innocuous developer-style prompts to audit production LLMs for malicious code generation at scale
Innoc2Scam-bench: a curated benchmark of 1,559 innocuous prompts that consistently elicit malicious scam URL generation across all tested LLMs
Large-scale evaluation across 11 production LLMs (including 7 released in 2025) showing 12.7–43.8% malicious code generation rates and <0.3% detection by state-of-the-art guardrails

🛡️ Threat Analysis

Data Poisoning Attack

Web-scale training data contains malicious scam endpoints that get absorbed by LLMs; the paper's central thesis is that this training data contamination directly causes LLMs to output malicious code — co-tagged per LLM03 requirement.