benchmark arXiv Jan 21, 2026 · 10w ago
Anmol Goel, Alan Ritter, Iryna Gurevych · Technical University of Darmstadt · National Research Center for Applied Cybersecurity ATHENE +1 more
Audits LLM unlearning via Partial Information Decomposition, revealing residual training data remains vulnerable to adversarial reconstruction attacks
Model Inversion Attack Sensitive Information Disclosure nlp
We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
llm transformer Technical University of Darmstadt · National Research Center for Applied Cybersecurity ATHENE · Georgia Institute of Technology
attack arXiv Jan 19, 2026 · 11w ago
Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta et al. · INSAIT · Sofia University +3 more
Jailbreaks MLLMs via adversarial prompting to auto-generate misleading charts, reducing human and MLLM QA accuracy by ~20 points
Prompt Injection multimodalvisionnlp
Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. Experiments in in-domain and cross-domain settings show that ChartAttack significantly degrades the QA performance of MLLM readers, reducing accuracy by an average of 19.6 points and 14.9 points, respectively. A human study further shows an average 20.2 point drop in accuracy for participants exposed to misleading charts generated by ChartAttack. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.
vlm llm multimodal INSAIT · Sofia University · TU Darmstadt +2 more