CluCERT: Certifying LLM Robustness via Clustering-Guided Denoising Smoothing

Recent advancements in Large Language Models (LLMs) have led to their widespread adoption in daily applications. Despite their impressive capabilities, they remain vulnerable to adversarial attacks, as even minor meaning-preserving changes such as synonym substitutions can lead to incorrect predictions. As a result, certifying the robustness of LLMs against such adversarial prompts is of vital importance. Existing approaches focused on word deletion or simple denoising strategies to achieve robustness certification. However, these methods face two critical limitations: (1) they yield loose robustness bounds due to the lack of semantic validation for perturbed outputs and (2) they suffer from high computational costs due to repeated sampling. To address these limitations, we propose CluCERT, a novel framework for certifying LLM robustness via clustering-guided denoising smoothing. Specifically, to achieve tighter certified bounds, we introduce a semantic clustering filter that reduces noisy samples and retains meaningful perturbations, supported by theoretical analysis. Furthermore, we enhance computational efficiency through two mechanisms: a refine module that extracts core semantics, and a fast synonym substitution strategy that accelerates the denoising process. Finally, we conduct extensive experiments on various downstream tasks and jailbreak defense scenarios. Experimental results demonstrate that our method outperforms existing certified approaches in both robustness bounds and computational efficiency.

Key Contributions

Semantic clustering filter that retains meaningful perturbations and tightens certified robustness bounds, with formal theoretical analysis via a recovery factor γ
Refine module and fast synonym substitution strategy that reduce repeated sampling costs and improve computational efficiency
Empirical validation on multiple downstream NLP tasks and jailbreak defense scenarios, outperforming prior certified approaches on both bound tightness and efficiency

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is certified robustness — a defense type explicitly listed under ML01 — providing provable guarantees against adversarial input perturbations (word substitutions) at inference time using an adaptation of randomized smoothing.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention

ExplainableGuard: Interpretable Adversarial Defense for Large Language Models Using Chain-of-Thought Reasoning

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Reinforcement Learning with Backtracking Feedback