ML Security Papers

ML Security Papers

Latest papers

1 papers

tool arXiv Dec 5, 2025 · Dec 2025

Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models

Mahesh Kumar Nandwana, Youngwan Lim, Joseph Liu et al. · Roblox

Deploys a fine-tuned LLM guardrail that detects jailbreaks and harmful content across dynamic, user-defined safety taxonomies

Prompt Injection nlp