Latest papers

2 papers
attack arXiv Oct 23, 2025 · Oct 2025

Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

Zheng-Xin Yong, Stephen H. Bach · Brown University

Discovers reasoning LLMs self-jailbreak via chain-of-thought after benign math/code fine-tuning, despite recognizing harmful requests

Transfer Learning Attack Prompt Injection nlp
PDF Code
tool arXiv Aug 21, 2025 · Aug 2025

PickleBall: Secure Deserialization of Pickle-based Machine Learning Models (Extended Report)

Andreas D. Kellas, Neophytos Christou, Wenxin Jiang et al. · Columbia University · Brown University +4 more

Defends against malicious pickle-based ML models on Hugging Face via static analysis and dynamic policy enforcement at load time

AI Supply Chain Attacks
PDF