ML Security Papers

Latest papers

18 papers

benchmark arXiv Mar 25, 2026 · 12d ago

Analysing the Safety Pitfalls of Steering Vectors

Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas et al. · Technical University of Munich

Activation steering vectors systematically erode LLM safety alignment, increasing jailbreak success rates up to 57% by interfering with refusal behavior directions

Prompt Injection nlp

PDF

attack arXiv Mar 19, 2026 · 18d ago

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa, Clemens Grange, Bernard Ghanem · King Abdullah University of Science and Technology · Technical University of Munich

Demonstrates VLM safety decisions rely on semantic cues rather than visual understanding, enabling automated steering to bypass safety controls

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF

tool arXiv Feb 4, 2026 · 8w ago

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Suyu Ma, Chunyang Chen, Hourieh Khalajzadeh et al. · CSIRO's Data61 · Technical University of Munich +2 more

Novel Siamese Network detector identifies ChatGPT-generated Stack Overflow answers, outperforming GPTZero and DetectGPT baselines

Output Integrity Attack nlp

PDF

benchmark arXiv Jan 24, 2026 · 10w ago

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Marton Szep, Jorge Marin Ruiz, Georgios Kaissis et al. · Technical University of Munich · TUM University Hospital +1 more

Benchmarks PII extraction attacks and four defenses against unintended memorization in fine-tuned LLMs using black-box probes

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

attack arXiv Jan 19, 2026 · 11w ago

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Johannes Kaiser, Alexander Ziller, Eleni Triantafillou et al. · Technical University of Munich · University of Potsdam +2 more

Exposes collusion vulnerability in iDP where adversaries manipulate others' privacy budgets to amplify membership inference attacks on targeted individuals

Membership Inference Attack

PDF

benchmark ACM Transactions on Embedded C... Jan 9, 2026 · 12w ago

Influence of Parallelism in Vector-Multiplication Units on Correlation Power Analysis

Manuel Brosch, Matthias Probst, Stefan Kögler et al. · Technical University of Munich · Fraunhofer Institute for Applied and Integrated Security

Characterizes how hardware parallelism in neural network accelerators reduces correlation power analysis attack success rates, with FPGA validation

Model Theft

PDF

survey arXiv Jan 7, 2026 · 12w ago

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

Andreea-Elena Bodea, Stephen Meisenbacher, Alexandra Klymenko et al. · Technical University of Munich

Surveys 72 papers on RAG privacy risks, taxonomizing knowledge base leakage, prompt injection, and membership inference attacks with mitigation maturity assessment

Membership Inference Attack Sensitive Information Disclosure Prompt Injection nlp

PDF

defense arXiv Dec 24, 2025 · Dec 2025

Robustness Certificates for Neural Networks against Adversarial Attacks

Sara Taheri, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar et al. · LMU Munich · Technical University of Munich +1 more

Certifies neural network robustness against data poisoning and adversarial attacks using control-theoretic barrier certificates with PAC guarantees

Data Poisoning Attack Input Manipulation Attack vision

PDF

attack arXiv Nov 10, 2025 · Nov 2025

Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

Asia Belfiore, Jonathan Passerat-Palmbach, Dmitrii Usynin · Imperial College London · Technical University of Munich

Novel hybrid MIA combining black-box inference with genomic domain metrics to attack DP-protected generative genomic language models

Membership Inference Attack generativenlp

1 citations PDF

attack arXiv Nov 8, 2025 · Nov 2025

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Alina Fastowski, Bardh Prenkaj, Yuxiao Li et al. · Technical University of Munich

Proposes Xmera MitM framework injecting false factual contexts into LLM prompts, achieving 85% attack success, detected by uncertainty-based classifiers

Prompt Injection nlp

PDF Code

tool arXiv Nov 6, 2025 · Nov 2025

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle et al. · Technical University of Munich · Munich Data Science Institute

Unified toolbox for reproducible LLM jailbreak research implementing 12 attacks, 7 datasets, and 13 judges

Input Manipulation Attack Prompt Injection nlp

2 citations PDF Code

attack arXiv Oct 31, 2025 · Oct 2025

Diffusion LLMs are Natural Adversaries for any LLM

David Lüdke, Tom Wollschläger, Paul Ungermann et al. · Technical University of Munich

Uses Diffusion LLMs as amortized jailbreak generators, producing low-perplexity transferable harmful prompts against black-box and proprietary LLMs

Prompt Injection nlpgenerative

3 citations PDF Code

attack DAGM GCPR Oct 16, 2025 · Oct 2025

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Sven Jacob, Weijia Shao, Gjergji Kasneci · Federal Institute for Occupational Safety and Health · Technical University of Munich

Proposes nuclear norm-regularized universal adversarial perturbations for video object detection that outperform PGD and Frank-Wolfe attacks while remaining stealthy

Input Manipulation Attack vision

PDF Code

attack arXiv Oct 13, 2025 · Oct 2025

Deep Research Brings Deeper Harm

Shuo Chen, Zonggen Li, Zhen Han et al. · LMU Munich · Siemens +6 more

Proposes two jailbreak attacks on LLM research agents — plan injection and intent hijack — that bypass alignment to produce dangerous biosecurity reports

Prompt Injection Excessive Agency nlp

PDF Code

benchmark EMNLP Oct 10, 2025 · Oct 2025

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner · Technical University of Munich · LMU Munich +1 more

Adversarial constructed-conversation attack exposes hidden societal biases in 11 LLMs by injecting fabricated biased turns into chat history

Prompt Injection nlp

PDF Code

defense arXiv Sep 11, 2025 · Sep 2025

ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning

Sena Ergisi, Luis Maßny, Rawad Bitar · Technical University of Munich

Defends federated learning from Byzantine attacks via dual gradient scoring on proximity and dissimilarity, robust under non-IID data

Data Poisoning Attack federated-learning

PDF Code

defense arXiv Aug 26, 2025 · Aug 2025

The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization

Stephen Meisenbacher, Alexandra Klymenko, Andreea-Elena Bodea et al. · Technical University of Munich

Demonstrates LLMs can exploit contextual clues in DP-sanitized text to reconstruct private originals, then proposes adversarial post-processing as a defense

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Aug 18, 2025 · Aug 2025

Beyond Trade-offs: A Unified Framework for Privacy, Robustness, and Communication Efficiency in Federated Learning

Yue Xia, Tayyebeh Jahani-Nezhad, Rawad Bitar · Technical University of Munich · Technische Universität Berlin

Defends federated learning against Byzantine clients using JL-compression-compatible robust aggregation with differential privacy guarantees

Data Poisoning Attack federated-learning

PDF

Latest papers

Analysing the Safety Pitfalls of Steering Vectors

SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Unintended Memorization of Sensitive Information in Fine-Tuned Language Models

Your Privacy Depends on Others: Collusion Vulnerabilities in Individual Differential Privacy

Influence of Parallelism in Vector-Multiplication Units on Correlation Power Analysis

SoK: Privacy Risks and Mitigations in Retrieval-Augmented Generation Systems

Robustness Certificates for Neural Networks against Adversarial Attacks

Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Diffusion LLMs are Natural Adversaries for any LLM

Structured Universal Adversarial Attacks on Object Detection for Video Sequences

Deep Research Brings Deeper Harm

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning

The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization

Beyond Trade-offs: A Unified Framework for Privacy, Robustness, and Communication Efficiency in Federated Learning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue