Latest papers

3 papers
defense arXiv Mar 3, 2026 · 5w ago

Understanding and Mitigating Dataset Corruption in LLM Steering

Cullen Anderson, Narmeen Oozeer, Foad Namjoo et al. · University of Massachusetts Amherst · Martian AI +2 more

Analyzes adversarial data poisoning of LLM contrastive steering datasets and defends with robust mean estimation

Data Poisoning Attack Training Data Poisoning nlp
PDF
defense arXiv Nov 24, 2025 · Nov 2025

Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

Steven Peh · Thoughtworks

Cryptographic prompt signing defense reduces LLM prompt injection success rates from 86.7% to 0% across 300 attacks

Prompt Injection nlp
1 citations PDF
attack arXiv Sep 7, 2025 · Sep 2025

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah et al. · Singapore University of Technology and Design · Nanyang Technological University +2 more

Ablates SAE latent features mediating refusal in LLMs to produce mechanistically-grounded jailbreaks via a three-stage pipeline

Prompt Injection nlp
PDF