defense 2025

Inverse Language Modeling towards Robust and Grounded LLMs

Davide Gabrielli , Simone Sestito , Iacopo Masi

0 citations · 26 references · arXiv

α

Published on arXiv

2510.01929

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ILM provides a unified framework that simultaneously improves LLM robustness to input perturbations and enables grounding by recovering potentially toxic input triggers from model outputs.

ILM (Inverse Language Modeling)

Novel technique introduced


The current landscape of defensive mechanisms for LLMs is fragmented and underdeveloped, unlike prior work on classifiers. To further promote adversarial robustness in LLMs, we propose Inverse Language Modeling (ILM), a unified framework that simultaneously 1) improves the robustness of LLMs to input perturbations, and, at the same time, 2) enables native grounding by inverting model outputs to identify potentially toxic or unsafe input triggers. ILM transforms LLMs from static generators into analyzable and robust systems, potentially helping RED teaming. ILM can lay the foundation for next-generation LLMs that are not only robust and grounded but also fundamentally more controllable and trustworthy. The code is publicly available at github.com/davegabe/pag-llm.


Key Contributions

  • Inverse Language Modeling (ILM) framework that simultaneously improves LLM robustness to adversarial input perturbations and enables native grounding by inverting outputs to identify unsafe triggers
  • Transforms LLMs from static generators into analyzable systems that can support red teaming by revealing input triggers behind model outputs
  • Unified defense applicable to both adversarial robustness (GCG-style attacks) and toxic/unsafe prompt identification

🛡️ Threat Analysis

Input Manipulation Attack

The paper explicitly targets robustness to input perturbations and includes a GCG (Greedy Coordinate Gradient) appendix, indicating defense against gradient-based adversarial suffix attacks on LLMs — the canonical ML01 threat for LLMs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timedigital
Applications
large language modelsred teamingllm safety