SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Hongbo Wang, MaungMaung AprilPyone, Isao Echizen · The University of Tokyo · National Institute of Informatics +1 more
Hongbo Wang, MaungMaung AprilPyone, Isao Echizen · The University of Tokyo · National Institute of Informatics +1 more
Neuron-level white-box defense suppresses toxic expert neurons in VLMs, cutting harmful outputs from 48% to 2.5% under adversarial jailbreaks
Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
Thanh Le, Hai Duong, Yusheng Ji et al. · The Graduate University for Advanced Studies · National Institute of Informatics +2 more
Grey-box attack on DRL-based 5G schedulers uses polytope abstract domains to craft adversarial CSI inputs degrading victim throughput by 70%
In 5G mobile communication systems, MU-MIMO has been applied to enhance spectral efficiency and support high data rates. To maximize spectral efficiency while providing fairness among users, the base station (BS) needs to selects a subset of users for data transmission. Given that this problem is NP-hard, DRL-based methods have been proposed to infer the near-optimal solutions in real-time, yet this approach has an intrinsic security problem. This paper investigates how a group of adversarial users can exploit unsanitized raw CSIs to launch a throughput degradation attack. Most existing studies only focused on systems in which adversarial users can obtain the exact values of victims' CSIs, but this is impractical in the case of uplink transmission in LTE/5G mobile systems. We note that the DRL policy contains an observation normalizer which has the mean and variance of the observation to improve training convergence. Adversarial users can then estimate the upper and lower bounds of the local observations including the CSIs of victims based solely on that observation normalizer. We develop an attacking scheme FGGM by leveraging polytope abstract domains, a technique used to bound the outputs of a neural network given the input ranges. Our goal is to find one set of intentionally manipulated CSIs which can achieve the attacking goals for the whole range of local observations of victims. Experimental results demonstrate that FGGM can determine a set of adversarial CSI vector controlled by adversarial users, then reuse those CSIs throughout the simulation to reduce the network throughput of a victim up to 70\% without knowing the exact value of victims' local observations. This study serves as a case study and can be applied to many other DRL-based problems, such as a knapsack-oriented resource allocation problems.