HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing
Yukai Zhao , Menghan Wu , Xing Hu , Xin Xia
Published on arXiv
2509.23835
AI Supply Chain Attacks
OWASP ML Top 10 — ML06
Key Finding
HFuzzer discovers 2.60x more unique hallucinated packages than mutational fuzzing and finds 46 unique hallucinated packages in GPT-4o alone, including hallucinations during environment configuration tasks.
HFuzzer
Novel technique introduced
Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production due to package hallucinations, in which LLMs recommend non-existent packages. These hallucinations can be exploited in software supply chain attacks, where malicious attackers exploit them to register harmful packages. It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks. Although researchers have proposed testing frameworks for fact-conflicting hallucinations in natural language generation, there is a lack of research on package hallucinations. To fill this gap, we propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations. HFUZZER adopts fuzzing technology and guides the model to infer a wider range of reasonable information based on phrases, thereby generating enough and diverse coding tasks. Furthermore, HFUZZER extracts phrases from package information or coding tasks to ensure the relevance of phrases and code, thereby improving the relevance of generated tasks and code. We evaluate HFUZZER on multiple LLMs and find that it triggers package hallucinations across all selected models. Compared to the mutational fuzzing framework, HFUZZER identifies 2.60x more unique hallucinated packages and generates more diverse tasks. Additionally, when testing the model GPT-4o, HFUZZER finds 46 unique hallucinated packages. Further analysis reveals that for GPT-4o, LLMs exhibit package hallucinations not only during code generation but also when assisting with environment configuration.
Key Contributions
- HFuzzer: a phrase-based fuzzing framework that generates diverse coding tasks by extracting phrases from package information to trigger LLM package hallucinations
- Demonstrates 2.60x more unique hallucinated packages discovered compared to mutational fuzzing baselines across multiple LLMs
- Finds 46 unique hallucinated packages in GPT-4o and shows hallucinations occur not only during code generation but also during environment configuration assistance
🛡️ Threat Analysis
Package hallucinations in LLM coding assistants directly enable supply chain attacks: attackers identify LLM-hallucinated (non-existent) package names and register them with malicious code. The paper frames this as a 'software supply chain attack' vector and targets the security of AI coding assistant output, which aligns with 'software vulnerabilities in AI coding assistants' under ML06.