Latest papers

2 papers
attack arXiv Jan 1, 2026 · Jan 2026

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Zongwei Wang, Bincheng Gu, Hongyu Yu et al. · Chongqing University · The University of Queensland +2 more

Belief Poisoning Attack corrupts LLM agent profiles and memory to make agents treat humans as outgroup, bypassing human-oriented safety behaviors

Prompt Injection Excessive Agency nlp
PDF Code
benchmark COLING Jan 5, 2025 · Jan 2025

Tougher Text, Smarter Models: Raising the Bar for Adversarial Defence Benchmarks

Yang Wang, Chenghua Lin · The University of Sheffield · Automated Analytics +1 more

Benchmarks textual adversarial defences across NLP tasks and proposes TTSO++, a calibration-based defence variant

Input Manipulation Attack nlp
4 citations PDF Code