Algorithms for Adversarially Robust Deep Learning

Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

Key Contributions

New technical results, adversarial training paradigms, and certification algorithms for robustness against adversarial examples in computer vision
State-of-the-art domain generalization algorithms for medical imaging, molecular identification, and image classification
Novel LLM jailbreaking attacks and defenses representing the frontier of robust language-based agents

🛡️ Threat Analysis

Input Manipulation Attack

The thesis introduces new adversarial example results, adversarial training paradigms, and certification algorithms for computer vision — directly targeting inference-time input manipulation attacks and their defenses.

Details

Domains

visionnlp

Model Types

cnntransformerllm

Threat Tags

white_boxinference_timetraining_time

Applications

2025 0 cit.

Input Manipulation Attack

72%

Algorithms for Adversarially Robust Deep Learning

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Reinforcement Learning with Backtracking Feedback

Unifying Adversarial Robustness and Training Across Text Scoring Models

BarrierSteer: LLM Safety via Learning Barrier Steering

Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Inverse Language Modeling towards Robust and Grounded LLMs

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection