Latest papers

8 papers
benchmark arXiv Mar 30, 2026 · 7d ago

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Haochuan Kevin Wang · Massachusetts Institute of Technology

Stage-level prompt injection benchmark tracking cryptographic canaries across four kill-chain stages in multi-agent systems

Prompt Injection nlp
PDF
benchmark arXiv Mar 11, 2026 · 26d ago

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan · Harvard University · Massachusetts Institute of Technology

Derives polynomial-to-exponential scaling law for jailbreak success under adversarial prompt injection using spin-glass theory

Prompt Injection nlp
PDF Code
defense arXiv Feb 26, 2026 · 5w ago

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Julianna Piskorz, David D. Baek et al. · University of Cambridge · Massachusetts Institute of Technology +4 more

Formalizes and detects steganographic reasoning in LLMs that allows misaligned models to evade AI oversight via covert output signals

Output Integrity Attack Excessive Agency nlp
PDF
defense IACR ePrint Dec 9, 2025 · Dec 2025

Improved Pseudorandom Codes from Permuted Puzzles

Miranda Christ, Noah Golowich, Sam Gunn et al. · Columbia University · Microsoft Research +5 more

Constructs provably robust LLM watermarks with subexponential security, surviving worst-case edits and detection-key-aware adversaries

Output Integrity Attack nlp
PDF
attack IEEE IoT-J Nov 10, 2025 · Nov 2025

Adversarial Node Placement in Decentralized Federated Learning: Maximum Spanning-Centrality Strategy and Performance Analysis

Adam Piaseczny, Eric Ruzomberka, Rohit Parasnis et al. · Purdue University · Princeton University +1 more

Proposes MaxSpAN-FL, a hybrid topology-aware strategy for placing Byzantine nodes in decentralized FL to maximize model degradation

Data Poisoning Attack federated-learning
PDF
survey arXiv Oct 21, 2025 · Oct 2025

Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption

Yepeng Liu, Xuandong Zhao, Dawn Song et al. · University of California · Massachusetts Institute of Technology

Position paper arguing LLM watermarking adoption requires incentive-aligned designs; proposes in-context watermarking for trusted-party misuse detection

Output Integrity Attack Model Theft nlp
2 citations PDF
defense arXiv Oct 20, 2025 · Oct 2025

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Jiawei Zhang, Andrew Estornell, David D. Baek et al. · ByteDance · University of Chicago +2 more

Inference-time defense reintroducing alignment tokens mid-generation to block jailbreaks and adversarial prefill attacks in LLMs

Input Manipulation Attack Prompt Injection nlp
PDF
benchmark arXiv Aug 26, 2025 · Aug 2025

Reliable Weak-to-Strong Monitoring of LLM Agents

Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu et al. · Scale AI · Carnegie Mellon University +1 more

Stress-tests LLM agent monitors via red-teaming and proposes hybrid scaffolding enabling weak-to-strong reliable monitoring

Excessive Agency Prompt Injection nlp
PDF