ML Security Papers

Latest papers

17 papers

benchmark arXiv Apr 20, 2026 · 4w ago

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai, Susanna Cifani, Francesco Giarrusso et al. · Sapienza University of Rome · DEXAI +1 more

Benchmark showing 55.75% jailbreak success across 31 LLMs using humanities-style prompt obfuscation to evade safety guardrails

Prompt Injection nlp

PDF Code

defense arXiv Apr 7, 2026 · 6w ago

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

Igor Maljkovic, Maria Rosaria Briglia, Iacopo Masi et al. · University of Genoa · Sapienza University of Rome +1 more

Detects and sanitizes harmful VLM prompts using hyperbolic geometry anomaly detection and explainable word attribution methods

Prompt Injection multimodalvisionnlp

PDF Code

defense arXiv Mar 27, 2026 · 8w ago

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Mujtaba Hussain Mirza, Antonio D'Orazio, Odelia Melamed et al. · Sapienza University of Rome · Weizmann Institute of Science

Training-free test-time defense using energy minimization to purify adversarial inputs for classifiers and vision-language models

Input Manipulation Attack visionmultimodalnlp

PDF Code

tool arXiv Mar 19, 2026 · 9w ago

Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro, Irene Amerini · Sapienza University of Rome

Supervised neural detectors for AI-generated text achieve more stable cross-language performance than commercial tools like GPTZero

Output Integrity Attack nlp

PDF Code

attack arXiv Feb 3, 2026 · Feb 2026

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan · University of Cagliari · Sapienza University of Rome +1 more

Demonstrates GCG prefix-placement jailbreaks achieve higher ASR than suffixes, exposing blind spots in LLM safety evaluation

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Dec 16, 2025 · Dec 2025

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti, Marcello Galisai, Matteo Prandi et al. · Sapienza University of Rome · VU Amsterdam +1 more

Novel jailbreak embeds harmful content in cyberpunk tales using Proppian analysis to bypass LLM safety, achieving 71.3% ASR across 26 models

Prompt Injection nlp

1 citations PDF

survey arXiv Dec 10, 2025 · Dec 2025

Chasing Shadows: Pitfalls in LLM Security Research

Jonathan Evertz, Niklas Risse, Nicolai Neuer et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Security and Privacy +4 more

Surveys nine methodological pitfalls in LLM security research found in all 72 surveyed papers, with case studies showing how each misleads results

Data Poisoning Attack Prompt Injection nlp

2 citations PDF

attack arXiv Dec 10, 2025 · Dec 2025

Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun et al. · Warsaw University of Technology · Sapienza University of Rome +2 more

Extends dataset inference attacks to audio generative models, showing DI succeeds at copyright verification where single-sample MIA fails

Membership Inference Attack audiogenerative

PDF

defense arXiv Nov 24, 2025 · Nov 2025

Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

Mostafa Mozafari, Farooq Ahmad Wani, Maria Sofia Bucarelli et al. · Sapienza University of Rome

Removes backdoor triggers and label-noise poisoning post-training via task arithmetic weight subtraction without original training data

Model Poisoning Data Poisoning Attack vision

PDF

attack arXiv Nov 19, 2025 · Nov 2025

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti, Matteo Prandi, Federico Pierucci et al. · DEXAI – Icaro Lab · Sapienza University of Rome +2 more

Adversarial poetry jailbreaks 25 frontier LLMs with 62% average success rate, exposing a universal stylistic bypass of safety alignment

Prompt Injection nlp

9 citations 1 influentialPDF

attack arXiv Oct 17, 2025 · Oct 2025

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi et al. · EPFL · Archimedes/Athena RC +3 more

Proves LLMs are injective and introduces SipIt to exactly reconstruct private input text from hidden activations

Model Inversion Attack Sensitive Information Disclosure nlp

15 citations 3 influentialPDF

benchmark arXiv Oct 14, 2025 · Oct 2025

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Francesco Giarrusso, Olga E. Sorokoletova, Vincenzo Suriani et al. · Sapienza University of Rome

Proposes a 7-family jailbreak taxonomy, Italian multi-turn dataset, and GPT-5 detection benchmark for LLM safety

Prompt Injection Benchmarks & Evaluation Red-Team Agents nlp

2 citations PDF

defense arXiv Oct 2, 2025 · Oct 2025

Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli, Simone Sestito, Iacopo Masi · Sapienza University of Rome

Defends LLMs against adversarial perturbations and unsafe triggers by inverting model outputs to expose attack inputs

Input Manipulation Attack Prompt Injection nlp

PDF Code

attack arXiv Sep 25, 2025 · Sep 2025

Evading Overlapping Community Detection via Proxy Node Injection

Dario Loi, Matteo Silvestri, Fabrizio Silvestri et al. · Sapienza University of Rome

DRL-based graph evasion attack injects proxy nodes to hide community membership from overlapping graph detectors

Input Manipulation Attack graph

PDF Code

attack arXiv Sep 11, 2025 · Sep 2025

The Coding Limits of Robust Watermarking for Generative Models

Danilo Francati, Yevin Nikhel Goonatilake, Shubham Pawar et al. · Sapienza University of Rome · George Mason University +1 more

Proves binary watermarks for generative models break above 50% bit corruption and demonstrates crop-resize defeats real image watermarking

Output Integrity Attack generativevision

PDF

benchmark arXiv Aug 29, 2025 · Aug 2025

Revisiting Deepfake Detection: Chronological Continual Learning and the Limits of Generalization

Federico Fontana, Anxhelo Diko, Romeo Lanzino et al. · Sapienza University of Rome · University Ibn Khaldoun of Tiaret +1 more

Continual learning framework for deepfake detection adapts 155x faster than retraining but reveals near-random future generalization

Output Integrity Attack vision

PDF

survey arXiv Aug 19, 2025 · Aug 2025

On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions

Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos et al. · Sapienza University of Rome

Surveys 200+ papers on FL security and privacy: Byzantine/poisoning attacks, backdoors, gradient leakage, and defenses

Data Poisoning Attack Model Poisoning Model Inversion Attack federated-learning

PDF

Latest papers

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Automatic detection of Gen-AI texts: A comparative framework of neural models

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Chasing Shadows: Pitfalls in LLM Security Research

Membership and Dataset Inference Attacks on Large Audio Generative Models

Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Language Models are Injective and Hence Invertible

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Inverse Language Modeling towards Robust and Grounded LLMs

Evading Overlapping Community Detection via Proxy Node Injection

The Coding Limits of Robust Watermarking for Generative Models

Revisiting Deepfake Detection: Chronological Continual Learning and the Limits of Generalization

On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue