ML Security Papers

Latest papers

2 papers

tool arXiv Feb 1, 2026 · 9w ago

José Pombal, Maya D'Eon, Nuno M. Guerreiro et al. · Sword Health · Instituto de Telecomunicações +1 more

Lightweight guardrail classifiers for LLM mental health chatbots reduce adversarial attack success rates versus general-purpose safeguards

Prompt Injection nlp

attack arXiv Oct 29, 2025 · Oct 2025

André V. Duarte, Xuying li, Bin Zeng et al. · Carnegie Mellon University · Instituto Superior Técnico +1 more

Agentic feedback-loop pipeline extracts memorized copyrighted books from LLMs, improving ROUGE-L by 24% over single-pass extraction

Model Inversion Attack Sensitive Information Disclosure nlp