Latest papers

2 papers
benchmark arXiv Feb 18, 2026 · 6w ago

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal et al. · Independent Researcher · EPFL +4 more

Benchmarks multi-turn, multilingual jailbreaking of LLM agents using a step-by-step illicit planning framework and novel time-to-jailbreak metrics

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Sep 6, 2025 · Sep 2025

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs

Debdeep Sanyal, Manodeep Ray, Murari Mandal · KIIT

Defends open-weight LLMs against malicious fine-tuning via bi-level adversarial training with a LoRA-generating hypernetwork adversary

Transfer Learning Attack Prompt Injection nlp
PDF