Dependable Artificial Intelligence with Reliability and Security (DAIReS): A Unified Syndrome Decoding Approach for Hallucination and Backdoor Trigger Detection
Hema Karnam Surendrababu , Nithin Nagaraj
Published on arXiv
2602.06532
Model Poisoning
OWASP ML Top 10 — ML10
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
A single syndrome decoding framework applied in sentence-embedding space can simultaneously detect backdoor-poisoned training samples and semantically incoherent (hallucinated) LLM outputs from self-referential prompts.
DAIReS (Syndrome Decoding for ML Security and Reliability)
Novel technique introduced
Machine Learning (ML) models, including Large Language Models (LLMs), are characterized by a range of system-level attributes such as security and reliability. Recent studies have demonstrated that ML models are vulnerable to multiple forms of security violations, among which backdoor data-poisoning attacks represent a particularly insidious threat, enabling unauthorized model behavior and systematic misclassification. In parallel, deficiencies in model reliability can manifest as hallucinations in LLMs, leading to unpredictable outputs and substantial risks for end users. In this work on Dependable Artificial Intelligence with Reliability and Security (DAIReS), we propose a novel unified approach based on Syndrome Decoding for the detection of both security and reliability violations in learning-based systems. Specifically, we adapt the syndrome decoding approach to the NLP sentence-embedding space, enabling the discrimination of poisoned and non-poisoned samples within ML training datasets. Additionally, the same methodology can effectively detect hallucinated content due to self referential meta explanation tasks in LLMs.
Key Contributions
- Novel syndrome decoding approach (from linear block codes) adapted to NLP sentence-embedding space for discriminating poisoned from non-poisoned training samples across static, paraphrase, and semantic backdoor trigger types
- Unified framework that extends the same syndrome decoding methodology to detect hallucinations arising from self-referential meta-explanation tasks in LLMs
- Quantitative measure of semantic degeneration in LLM outputs as a proxy for detecting reliability failures (LLM collapse)
🛡️ Threat Analysis
Secondary contribution applies the same syndrome decoding methodology to detect semantic incoherence in LLM outputs caused by self-referential meta-explanation prompts — a form of output integrity verification that quantifies 'semantic degeneration' in generated text.
Primary contribution is detecting backdoor triggers (static, paraphrase, and semantic) in ML training datasets by adapting syndrome decoding to NLP sentence-embedding space — a direct defense against backdoor/trojan attacks.