defense arXiv Aug 4, 2025 · Aug 2025
Kennedy Edemacu, Vinay M. Shashidhar, Micheal Tuape et al. · The City University of New York · Northern Michigan University +4 more
Defends RAG systems against knowledge poisoning by filtering adversarial texts from retrieved context before LLM generation
Data Poisoning Attack Prompt Injection nlp
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
llm transformer traditional_ml The City University of New York · Northern Michigan University · Lappeenranta-Lahti University of Technology +3 more
attack arXiv Feb 26, 2026 · 5w ago
Kennedy Edemacu, Mohammad Mahdi Shokri · The City University of New York
Poisons multimodal RAG knowledge base metadata to manipulate retrieval and induce attacker-desired VLM responses with 91% success rate
Data Poisoning Attack Prompt Injection multimodalnlp
Retrieval-augmented generation (RAG) has emerged as a powerful paradigm for enhancing multimodal large language models by grounding their responses in external, factual knowledge and thus mitigating hallucinations. However, the integration of externally sourced knowledge bases introduces a critical attack surface. Adversaries can inject malicious multimodal content capable of influencing both retrieval and downstream generation. In this work, we present MM-MEPA, a multimodal poisoning attack that targets the metadata components of image-text entries while leaving the associated visual content unaltered. By only manipulating the metadata, MM-MEPA can still steer multimodal retrieval and induce attacker-desired model responses. We evaluate the attack across multiple benchmark settings and demonstrate its severity. MM-MEPA achieves an attack success rate of up to 91\% consistently disrupting system behaviors across four retrievers and two multimodal generators. Additionally, we assess representative defense strategies and find them largely ineffective against this form of metadata-only poisoning. Our findings expose a critical vulnerability in multimodal RAG and underscore the urgent need for more robust, defense-aware retrieval and knowledge integration methods.
vlm llm multimodal The City University of New York