Provable Adversarial Robustness in In-Context Learning

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

Key Contributions

Distributionally robust meta-learning framework for ICL using Wasserstein-ball uncertainty sets, providing worst-case guarantees against adversarial task-distribution shifts
Non-asymptotic bound showing maximum tolerable adversarial perturbation scales as ρ_max ∝ √m (model capacity) and the extra in-context examples needed scale as N_ρ − N₀ ∝ ρ²
Synthetic experiments on linear self-attention Transformers that empirically confirm the derived robustness-capacity and sample-complexity scaling laws

🛡️ Threat Analysis

Input Manipulation Attack

The paper's core contribution is certified/provable adversarial robustness guarantees for ICL: it derives non-asymptotic worst-case bounds under Wasserstein-ball distribution shifts, directly paralleling the certified robustness strand of ML01. The adversarial threat model is an attacker who maximally shifts the task distribution at inference time.