Latest papers

5 papers
benchmark arXiv Nov 24, 2025 · Nov 2025

Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan, Ananya Mujoo · California Institute of Technology · Evergreen Valley College

Automated pipeline generating 1,500 psychologically-grounded multi-turn FITD jailbreaks; GPT family shows 32pp ASR increase with conversational history

Prompt Injection nlp
2 citations PDF
defense arXiv Nov 24, 2025 · Nov 2025

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra · California Institute of Technology

Probabilistic (k,ε)-unstable certificate tightens SmoothLLM's jailbreak defense guarantees for both GCG and PAIR attacks

Input Manipulation Attack Prompt Injection nlp
1 citations PDF Code
defense arXiv Oct 5, 2025 · Oct 2025

Concept-Based Masking: A Patch-Agnostic Defense Against Adversarial Patch Attacks

Ayushi Mehrotra, Derek Peng, Dipkamal Bhusal et al. · California Institute of Technology · University of California +1 more

Defends against adversarial patches by masking top concept activation vectors, requiring no prior knowledge of patch size or location

Input Manipulation Attack vision
PDF Code
defense arXiv Sep 30, 2025 · Sep 2025

PUREVQ-GAN: Defending Data Poisoning Attacks through Vector-Quantized Bottlenecks

Alexander Branch, Omead Pooladzandi, Radin Khosraviani et al. · University of California · California Institute of Technology

Defends image classifiers against poisoning and backdoor attacks via VQ-VAE bottleneck that destroys fine-grained trigger patterns pre-training

Data Poisoning Attack Model Poisoning vision
PDF
benchmark arXiv Aug 11, 2025 · Aug 2025

Multi-Turn Jailbreaks Are Simpler Than They Seem

Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick et al. · Imperial College London · California Institute of Technology

Empirical analysis reveals multi-turn LLM jailbreaks are no more sophisticated than repeatedly resampling single-turn attacks

Prompt Injection nlp
PDF Code