Latest papers

15 papers
defense arXiv Mar 27, 2026 · 10d ago

A Provable Energy-Guided Test-Time Defense Boosting Adversarial Robustness of Large Vision-Language Models

Mujtaba Hussain Mirza, Antonio D'Orazio, Odelia Melamed et al. · Sapienza University of Rome · Weizmann Institute of Science

Training-free test-time defense using energy minimization to purify adversarial inputs for classifiers and vision-language models

Input Manipulation Attack visionmultimodalnlp
PDF Code
tool arXiv Mar 19, 2026 · 18d ago

Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro, Irene Amerini · Sapienza University of Rome

Supervised neural detectors for AI-generated text achieve more stable cross-language performance than commercial tools like GPTZero

Output Integrity Attack nlp
PDF Code
attack arXiv Feb 3, 2026 · 8w ago

Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan · University of Cagliari · Sapienza University of Rome +1 more

Demonstrates GCG prefix-placement jailbreaks achieve higher ASR than suffixes, exposing blind spots in LLM safety evaluation

Input Manipulation Attack Prompt Injection nlp
PDF
attack arXiv Dec 16, 2025 · Dec 2025

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti, Marcello Galisai, Matteo Prandi et al. · Sapienza University of Rome · VU Amsterdam +1 more

Novel jailbreak embeds harmful content in cyberpunk tales using Proppian analysis to bypass LLM safety, achieving 71.3% ASR across 26 models

Prompt Injection nlp
1 citations PDF
survey arXiv Dec 10, 2025 · Dec 2025

Chasing Shadows: Pitfalls in LLM Security Research

Jonathan Evertz, Niklas Risse, Nicolai Neuer et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Security and Privacy +4 more

Surveys nine methodological pitfalls in LLM security research found in all 72 surveyed papers, with case studies showing how each misleads results

Data Poisoning Attack Prompt Injection nlp
2 citations PDF
attack arXiv Dec 10, 2025 · Dec 2025

Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun et al. · Warsaw University of Technology · Sapienza University of Rome +2 more

Extends dataset inference attacks to audio generative models, showing DI succeeds at copyright verification where single-sample MIA fails

Membership Inference Attack audiogenerative
PDF
defense arXiv Nov 24, 2025 · Nov 2025

Subtract the Corruption: Training-Data-Free Corrective Machine Unlearning using Task Arithmetic

Mostafa Mozafari, Farooq Ahmad Wani, Maria Sofia Bucarelli et al. · Sapienza University of Rome

Removes backdoor triggers and label-noise poisoning post-training via task arithmetic weight subtraction without original training data

Model Poisoning Data Poisoning Attack vision
PDF
attack arXiv Nov 19, 2025 · Nov 2025

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Piercosma Bisconti, Matteo Prandi, Federico Pierucci et al. · DEXAI – Icaro Lab · Sapienza University of Rome +2 more

Adversarial poetry jailbreaks 25 frontier LLMs with 62% average success rate, exposing a universal stylistic bypass of safety alignment

Prompt Injection nlp
9 citations 1 influentialPDF
attack arXiv Oct 17, 2025 · Oct 2025

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi et al. · EPFL · Archimedes/Athena RC +3 more

Proves LLMs are injective and introduces SipIt to exactly reconstruct private input text from hidden activations

Model Inversion Attack Sensitive Information Disclosure nlp
15 citations 3 influentialPDF
benchmark arXiv Oct 14, 2025 · Oct 2025

Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Francesco Giarrusso, Olga E. Sorokoletova, Vincenzo Suriani et al. · Sapienza University of Rome

Proposes a 7-family jailbreak taxonomy, Italian multi-turn dataset, and GPT-5 detection benchmark for LLM safety

Prompt Injection nlp
2 citations PDF
defense arXiv Oct 2, 2025 · Oct 2025

Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli, Simone Sestito, Iacopo Masi · Sapienza University of Rome

Defends LLMs against adversarial perturbations and unsafe triggers by inverting model outputs to expose attack inputs

Input Manipulation Attack Prompt Injection nlp
PDF Code
attack arXiv Sep 25, 2025 · Sep 2025

Evading Overlapping Community Detection via Proxy Node Injection

Dario Loi, Matteo Silvestri, Fabrizio Silvestri et al. · Sapienza University of Rome

DRL-based graph evasion attack injects proxy nodes to hide community membership from overlapping graph detectors

Input Manipulation Attack graph
PDF Code
attack arXiv Sep 11, 2025 · Sep 2025

The Coding Limits of Robust Watermarking for Generative Models

Danilo Francati, Yevin Nikhel Goonatilake, Shubham Pawar et al. · Sapienza University of Rome · George Mason University +1 more

Proves binary watermarks for generative models break above 50% bit corruption and demonstrates crop-resize defeats real image watermarking

Output Integrity Attack generativevision
PDF
benchmark arXiv Aug 29, 2025 · Aug 2025

Revisiting Deepfake Detection: Chronological Continual Learning and the Limits of Generalization

Federico Fontana, Anxhelo Diko, Romeo Lanzino et al. · Sapienza University of Rome · University Ibn Khaldoun of Tiaret +1 more

Continual learning framework for deepfake detection adapts 155x faster than retraining but reveals near-random future generalization

Output Integrity Attack vision
PDF
survey arXiv Aug 19, 2025 · Aug 2025

On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions

Daniel M. Jimenez-Gutierrez, Yelizaveta Falkouskaya, Jose L. Hernandez-Ramos et al. · Sapienza University of Rome

Surveys 200+ papers on FL security and privacy: Byzantine/poisoning attacks, backdoors, gradient leakage, and defenses

Data Poisoning Attack Model Poisoning Model Inversion Attack federated-learning
PDF