Latest papers

4 papers
attack arXiv Jan 29, 2026 · 9w ago

The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Devanshu Sahoo, Manish Prasad, Vasudev Majhi et al. · BITS Pilani · Trustwise +1 more

Embeds adversarial directives in AST comment nodes to hijack LLM-based code graders, achieving >95% manipulation success across 9 SOTA models

Prompt Injection nlp
PDF
attack arXiv Dec 11, 2025 · Dec 2025

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

Devanshu Sahoo, Manish Prasad, Vasudev Majhi et al. · BITS Pilani · KIIT University

Indirect prompt injection via adversarial PDF manipulation flips LLM paper-review decisions from Reject to Accept at up to 86% success rate

Prompt Injection nlp
2 citations PDF Code
defense arXiv Sep 16, 2025 · Sep 2025

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Shaz Furniturewala, Arkaitz Zubiaga · BITS Pilani · Queen Mary University of London

Defends toxicity classifiers against adversarial text attacks by identifying and suppressing vulnerable attention heads via mechanistic interpretability

Input Manipulation Attack nlp
PDF
defense arXiv Aug 4, 2025 · Aug 2025

TRACEALIGN -- Tracing the Drift: Attributing Alignment Failures to Training-Time Belief Sources in LLMs

Amitava Das, Vinija Jain, Aman Chadha · BITS Pilani · Meta AI +1 more

Traces LLM alignment failures to training corpus sources and defends against jailbreaks via inference filters, DPO regularization, and provenance-aware decoding

Prompt Injection nlp
PDF Code