defense 2025

RLBind: Adversarial-Invariant Cross-Modal Alignment for Unified Robust Embeddings

Yuhong Lu

0 citations

α

Published on arXiv

2509.14383

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

RLBind consistently outperforms the LanguageBind backbone and standard fine-tuning baselines in both clean accuracy and norm-bounded adversarial robustness across Image, Audio, Thermal, and Video modalities.

RLBind

Novel technique introduced


Unified multi-modal encoders that bind vision, audio, and other sensors into a shared embedding space are attractive building blocks for robot perception and decision-making. However, on-robot deployment exposes the vision branch to adversarial and natural corruptions, making robustness a prerequisite for safety. Prior defenses typically align clean and adversarial features within CLIP-style encoders and overlook broader cross-modal correspondence, yielding modest gains and often degrading zero-shot transfer. We introduce RLBind, a two-stage adversarial-invariant cross-modal alignment framework for robust unified embeddings. Stage 1 performs unsupervised fine-tuning on clean-adversarial pairs to harden the visual encoder. Stage 2 leverages cross-modal correspondence by minimizing the discrepancy between clean/adversarial features and a text anchor, while enforcing class-wise distributional alignment across modalities. Extensive experiments on Image, Audio, Thermal, and Video data show that RLBind consistently outperforms the LanguageBind backbone and standard fine-tuning baselines in both clean accuracy and norm-bounded adversarial robustness. By improving resilience without sacrificing generalization, RLBind provides a practical path toward safer multi-sensor perception stacks for embodied robots in navigation, manipulation, and other autonomy settings.


Key Contributions

  • Two-stage adversarial-invariant cross-modal alignment framework (RLBind) that hardens the visual encoder of LanguageBind without degrading zero-shot generalization.
  • Stage 1 unsupervised fine-tuning on clean-adversarial pairs for initial robustness, followed by Stage 2 cross-modal correspondence alignment using a text anchor and class-wise distributional alignment.
  • First systematic robustness study and defense of LanguageBind across Image, Audio, Thermal, and Video modalities, outperforming prior defenses like RobustCLIP.

🛡️ Threat Analysis

Input Manipulation Attack

RLBind is a defense against norm-bounded adversarial perturbations on the visual encoder of multi-modal models — a direct defense against input manipulation attacks at inference time. The two-stage framework uses adversarial training (clean-adversarial pair fine-tuning) and cross-modal alignment to achieve certified adversarial robustness.


Details

Domains
visionaudiomultimodal
Model Types
multimodaltransformervlm
Threat Tags
white_boxinference_timedigital
Applications
robot perceptionmulti-modal classificationautonomous navigationrobot manipulation