defense arXiv Nov 24, 2025 · Nov 2025
Adarsh Kumarappan, Ayushi Mehrotra · California Institute of Technology
Probabilistic (k,ε)-unstable certificate tightens SmoothLLM's jailbreak defense guarantees for both GCG and PAIR attacks
Input Manipulation Attack Prompt Injection nlp
The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.
llm transformer California Institute of Technology
defense arXiv Oct 5, 2025 · Oct 2025
Ayushi Mehrotra, Derek Peng, Dipkamal Bhusal et al. · California Institute of Technology · University of California +1 more
Defends against adversarial patches by masking top concept activation vectors, requiring no prior knowledge of patch size or location
Input Manipulation Attack vision
Adversarial patch attacks pose a practical threat to deep learning models by forcing targeted misclassifications through localized perturbations, often realized in the physical world. Existing defenses typically assume prior knowledge of patch size or location, limiting their applicability. In this work, we propose a patch-agnostic defense that leverages concept-based explanations to identify and suppress the most influential concept activation vectors, thereby neutralizing patch effects without explicit detection. Evaluated on Imagenette with a ResNet-50, our method achieves higher robust and clean accuracy than the state-of-the-art PatchCleanser, while maintaining strong performance across varying patch sizes and locations. Our results highlight the promise of combining interpretability with robustness and suggest concept-driven defenses as a scalable strategy for securing machine learning models against adversarial patch attacks.
cnn California Institute of Technology · University of California · Rochester Institute of Technology