defense 2026

Provable Adversarial Robustness in In-Context Learning

Di Zhang

0 citations · 22 references · arXiv (Cornell University)

α

Published on arXiv

2602.17743

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Model robustness in ICL scales with the square root of attention head dimension (ρ_max ∝ √m), and adversarial distribution shifts require O(ρ²) additional in-context examples to maintain performance.

Distributionally Robust ICL (DR-ICL)

Novel technique introduced


Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.


Key Contributions

  • Distributionally robust meta-learning framework for ICL using Wasserstein-ball uncertainty sets, providing worst-case guarantees against adversarial task-distribution shifts
  • Non-asymptotic bound showing maximum tolerable adversarial perturbation scales as ρ_max ∝ √m (model capacity) and the extra in-context examples needed scale as N_ρ − N₀ ∝ ρ²
  • Synthetic experiments on linear self-attention Transformers that empirically confirm the derived robustness-capacity and sample-complexity scaling laws

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is certified/provable adversarial robustness guarantees for ICL: it derives non-asymptotic worst-case bounds under Wasserstein-ball distribution shifts, directly paralleling the certified robustness strand of ML01. The adversarial threat model is an attacker who maximally shifts the task distribution at inference time.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_time
Datasets
synthetic linear regression tasks
Applications
in-context learninglarge language models