Latest papers

13 papers
benchmark arXiv Mar 11, 2026 · 26d ago

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan · Harvard University · Massachusetts Institute of Technology

Derives polynomial-to-exponential scaling law for jailbreak success under adversarial prompt injection using spin-glass theory

Prompt Injection nlp
PDF Code
benchmark arXiv Feb 23, 2026 · 6w ago

Agents of Chaos

Natalie Shapira, Chris Wendler, Avery Yen et al. · Northeastern University · Independent Researcher +11 more

Red-teams live autonomous LLM agents over two weeks, documenting 11 case studies of dangerous failures including system takeover, DoS, and sensitive data disclosure

Excessive Agency Prompt Injection Insecure Plugin Design nlp
3 citations PDF
attack arXiv Feb 18, 2026 · 6w ago

Narrow fine-tuning erodes safety alignment in vision-language agents

Idhant Gulati, Shivam Raval · University of California · Harvard University

LoRA fine-tuning VLMs on narrow harmful datasets causes emergent safety misalignment that generalizes across modalities, with multimodal evaluation revealing 70% misalignment at rank 128

Transfer Learning Attack Prompt Injection multimodalvisionnlp
PDF
defense arXiv Feb 6, 2026 · 8w ago

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Cheol Woo Kim, Davin Choo, Tzeh Yuan Neoh et al. · Harvard University

Proposes Stackelberg Security Games as a unifying framework for strategic AI oversight against data poisoning, evaluation manipulation, and deployment attacks

Data Poisoning Attack Model Skewing Training Data Poisoning nlpreinforcement-learning
PDF
defense arXiv Feb 6, 2026 · 8w ago

ArcMark: Multi-bit LLM Watermark via Optimal Transport

Atefeh Gilani, Carol Xuan Long, Sajani Vithana et al. · Arizona State University · Harvard University

Derives information-theoretic capacity of multi-bit LLM watermarking and proposes ArcMark, a capacity-achieving distortion-free scheme via optimal transport

Output Integrity Attack nlp
PDF
attack arXiv Feb 6, 2026 · 8w ago

TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao et al. · Harvard University · Daegu Gyeongbuk Institute of Science and Technology +1 more

RL-based black-box jailbreak framework that reweights historical vulnerability signals to attack LLMs more efficiently

Prompt Injection nlp
PDF
survey arXiv Jan 14, 2026 · 11w ago

The Promptware Kill Chain: How Prompt Injections Gradually Evolved Into a Multistep Malware Delivery Mechanism

Oleg Brodt, Elad Feldman, Bruce Schneier et al. · Ben-Gurion University of the Negev · Tel Aviv University +2 more

Surveys 36 LLM attack incidents and proposes a seven-stage promptware kill chain mapping prompt injection to multi-step malware delivery

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Jan 6, 2026 · Jan 2026

Topology-Independent Robustness of the Weighted Mean under Label Poisoning Attacks in Heterogeneous Decentralized Learning

Jie Peng, Weiyu Li, Stefan Vlaski et al. · Sun Yat-Sen University · Harvard University +1 more

Theoretically proves weighted mean aggregator can outperform robust aggregators under label poisoning in decentralized learning, exposing topology-dependent weaknesses of robust aggregators

Data Poisoning Attack federated-learning
PDF
attack arXiv Nov 19, 2025 · Nov 2025

When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Zhaoxin Zhang, Borui Chen, Yiming Hu et al. · City University of Macau · University of Vienna +3 more

Novel LLM jailbreak using conceptual morphology triggers to shift ideological orientation in outputs without triggering safety filters

Prompt Injection nlp
PDF
defense arXiv Nov 4, 2025 · Nov 2025

Verifying LLM Inference to Detect Model Weight Exfiltration

Roy Rinberg, Adam Karvonen, Alexander Hoover et al. · Harvard University · ML Alignment & Theory Scholars (MATS) +2 more

Defends against LLM weight theft via steganographic output channels by verifying inference non-determinism, achieving >200x adversary slowdown

Model Theft Model Theft nlp
2 citations PDF
attack arXiv Oct 20, 2025 · Oct 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang, Shreyansh Padarha, Andrew Lee et al. · University of Oxford · Harvard University

Discovers two simple prompt-level attacks that bypass safety in RL-trained LLM search agents by triggering search before refusal tokens

Prompt Injection Excessive Agency nlpreinforcement-learning
1 citations PDF
defense arXiv Sep 29, 2025 · Sep 2025

Incentive-Aligned Multi-Source LLM Summaries

Yanchen Jiang, Zhe Feng, Aranyak Mehta · Harvard University · Google Research

Defends LLM summarization pipelines against indirect prompt injection by scoring sources via peer prediction before synthesis

Prompt Injection nlp
PDF
benchmark arXiv Sep 16, 2025 · Sep 2025

Towards mitigating information leakage when evaluating safety monitors

Gerard Boxo, Aman Neelappa, Shivam Raval · Independent · Harvard University

Benchmarks LLM safety monitors (linear probes) revealing 10–40% AUROC inflation from textual leakage artifacts, not genuine internal signals

Prompt Injection nlp
PDF