ML Security Papers

Stats

Latest papers

9 papers

benchmark arXiv Mar 20, 2026 · 17d ago

Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition

Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa et al. · KP Labs · Silesian University of Technology +4 more

Kaggle competition benchmark for detecting backdoor triggers in time series forecasting models for spacecraft telemetry

Model Poisoning timeseries

PDF Code

defense arXiv Mar 3, 2026 · 4w ago

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more

Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality

Prompt Injection visionmultimodalgenerative

PDF Code

tool arXiv Jan 27, 2026 · 9w ago

On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Michał Gromadzki, Anna Wróblewska, Agnieszka Kaliska · Warsaw University of Technology · Samsung R&D Institute Poland +1 more

Proposes LLM-specific fine-tuning paradigms for AI-generated text detection, achieving 99.6% token-level accuracy across 21 LLMs

Output Integrity Attack nlp

PDF Code

attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp

10 citations PDF

attack arXiv Dec 10, 2025 · Dec 2025

Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun et al. · Warsaw University of Technology · Sapienza University of Rome +2 more

Extends dataset inference attacks to audio generative models, showing DI succeeds at copyright verification where single-sample MIA fails

Membership Inference Attack audiogenerative

PDF

attack arXiv Nov 25, 2025 · Nov 2025

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Jakub Hoscilowicz, Artur Janicki · Warsaw University of Technology

PGD-based entropy-maximizing adversarial images disrupt multimodal LLM outputs and transfer to GPT-5.1 and other proprietary VLMs

Input Manipulation Attack Prompt Injection visionmultimodalnlp

1 citations PDF

attack arXiv Nov 10, 2025 · Nov 2025

On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch et al. · Polish Academy of Sciences · IDEAS NCBR +5 more

Steals GNN models with as few as 100 queries by decoupling query-free backbone extraction from strategic head extraction

Model Theft graph

PDF Code

defense arXiv Oct 9, 2025 · Oct 2025

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Stanisław Pawlak, Jan Dubiński, Daniel Marczak et al. · Warsaw University of Technology · NASK National Research Institute +3 more

Proposes Backdoor Vectors to unify backdoor attacks in model merging, plus stronger SBV attack and assumption-free IBVS defense

Model Poisoning visionmultimodal

PDF

benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp

8 citations 2 influentialPDF Code

Latest papers

Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition

Conditioned Activation Transport for T2I Safety Steering

On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Membership and Dataset Inference Attacks on Large Audio Generative Models

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

On Stealing Graph Neural Network Models

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Eliciting Secret Knowledge from Language Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue