Highlight & Summarize: RAG without the jailbreaks

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts ("highlights") relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.

Key Contributions

Highlight & Summarize (H&S): a RAG design pattern that prevents jailbreaks and model hijacking by never exposing user queries to the generative LLM, splitting the pipeline into a highlighter (extractive) and summarizer (generative) component
Formal security analysis with Lean formalization showing H&S exponentially reduces attacker control over the pipeline, plus empirical evaluation against non-adaptive and adaptive attacks
Empirical evaluation on RepliQA and BioASQ showing H&S achieves comparable or better answer quality than vanilla RAG (e.g., correctness 95% vs 94% on RepliQA)