defense arXiv Nov 16, 2025 · Nov 2025
Thomas Rivasseau · McGill University
Proposes inserting periodic alignment reminders into LLM context to defend against long-input jailbreaks and CoT scheming
Prompt Injection nlp
Current Large Language Model alignment research mostly focuses on improving model robustness against adversarial attacks and misbehavior by training on examples and prompting. Research has shown that LLM jailbreak probability increases with the size of the user input or conversation length. There is a lack of appropriate research into means of strengthening alignment which also scale with user input length. We propose interruptions as a possible solution to this problem. Interruptions are control sentences added to the user input approximately every x tokens for some arbitrary x. We suggest that this can be generalized to the Chain-of-Thought process to prevent scheming.
llm McGill University
defense arXiv Dec 2, 2025 · Dec 2025
Thomas Rivasseau · McGill University
Defends LLMs against long-context jailbreaks by inserting runtime control sentences into context, without retraining
Prompt Injection nlp
Current research on operator control of Large Language Models improves model robustness against adversarial attacks and misbehavior by training on preference examples, prompting, and input/output filtering. Despite good results, LLMs remain susceptible to abuse, and jailbreak probability increases with context length. There is a need for robust LLM security guarantees in long-context situations. We propose control sentences inserted into the LLM context as invasive context engineering to partially solve the problem. We suggest this technique can be generalized to the Chain-of-Thought process to prevent scheming. Invasive Context Engineering does not rely on LLM training, avoiding data shortage pitfalls which arise in training models for long context situations.
llm McGill University