defense 2026

Dependable Artificial Intelligence with Reliability and Security (DAIReS): A Unified Syndrome Decoding Approach for Hallucination and Backdoor Trigger Detection

Hema Karnam Surendrababu , Nithin Nagaraj

National Institute of Advanced Studies

0 citations · 39 references · arXiv (Cornell University)

Published on arXiv

2602.06532

Model Poisoning

OWASP ML Top 10 — ML10

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

A single syndrome decoding framework applied in sentence-embedding space can simultaneously detect backdoor-poisoned training samples and semantically incoherent (hallucinated) LLM outputs from self-referential prompts.

DAIReS (Syndrome Decoding for ML Security and Reliability)

Novel technique introduced

Machine Learning (ML) models, including Large Language Models (LLMs), are characterized by a range of system-level attributes such as security and reliability. Recent studies have demonstrated that ML models are vulnerable to multiple forms of security violations, among which backdoor data-poisoning attacks represent a particularly insidious threat, enabling unauthorized model behavior and systematic misclassification. In parallel, deficiencies in model reliability can manifest as hallucinations in LLMs, leading to unpredictable outputs and substantial risks for end users. In this work on Dependable Artificial Intelligence with Reliability and Security (DAIReS), we propose a novel unified approach based on Syndrome Decoding for the detection of both security and reliability violations in learning-based systems. Specifically, we adapt the syndrome decoding approach to the NLP sentence-embedding space, enabling the discrimination of poisoned and non-poisoned samples within ML training datasets. Additionally, the same methodology can effectively detect hallucinated content due to self referential meta explanation tasks in LLMs.

Key Contributions

Novel syndrome decoding approach (from linear block codes) adapted to NLP sentence-embedding space for discriminating poisoned from non-poisoned training samples across static, paraphrase, and semantic backdoor trigger types
Unified framework that extends the same syndrome decoding methodology to detect hallucinations arising from self-referential meta-explanation tasks in LLMs
Quantitative measure of semantic degeneration in LLM outputs as a proxy for detecting reliability failures (LLM collapse)

🛡️ Threat Analysis

Output Integrity Attack

Secondary contribution applies the same syndrome decoding methodology to detect semantic incoherence in LLM outputs caused by self-referential meta-explanation prompts — a form of output integrity verification that quantifies 'semantic degeneration' in generated text.

Model Poisoning

Primary contribution is detecting backdoor triggers (static, paraphrase, and semantic) in ML training datasets by adapting syndrome decoding to NLP sentence-embedding space — a direct defense against backdoor/trojan attacks.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_time

Applications

nlp text classificationlarge language modelsllm hallucination detection

Read PDF arXiv DOI

Dependable Artificial Intelligence with Reliability and Security (DAIReS): A Unified Syndrome Decoding Approach for Hallucination and Backdoor Trigger Detection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Backdoor Samples Detection Based on Perturbation Discrepancy Consistency in Pre-trained Language Models

Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis

Localizing Malicious Outputs from CodeLLM