Latest papers

9 papers
benchmark arXiv Mar 20, 2026 · 17d ago

Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition

Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa et al. · KP Labs · Silesian University of Technology +4 more

Kaggle competition benchmark for detecting backdoor triggers in time series forecasting models for spacecraft telemetry

Model Poisoning timeseries
PDF Code
defense arXiv Mar 3, 2026 · 4w ago

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more

Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality

Prompt Injection visionmultimodalgenerative
PDF Code
tool arXiv Jan 27, 2026 · 9w ago

On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

Michał Gromadzki, Anna Wróblewska, Agnieszka Kaliska · Warsaw University of Technology · Samsung R&D Institute Poland +1 more

Proposes LLM-specific fine-tuning paradigms for AI-generated text detection, achieving 99.6% token-level accuracy across 21 LLMs

Output Integrity Attack nlp
PDF Code
attack arXiv Dec 10, 2025 · Dec 2025

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Jan Betley, Jorio Cocola, Dylan Feng et al. · Truthful AI · MATS Fellowship +3 more

Demonstrates inductive backdoors and persona-poisoning attacks that corrupt LLMs through narrow fine-tuning generalization

Model Poisoning Data Poisoning Attack Training Data Poisoning nlp
10 citations PDF
attack arXiv Dec 10, 2025 · Dec 2025

Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun et al. · Warsaw University of Technology · Sapienza University of Rome +2 more

Extends dataset inference attacks to audio generative models, showing DI succeeds at copyright verification where single-sample MIA fails

Membership Inference Attack audiogenerative
PDF
attack arXiv Nov 25, 2025 · Nov 2025

Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Jakub Hoscilowicz, Artur Janicki · Warsaw University of Technology

PGD-based entropy-maximizing adversarial images disrupt multimodal LLM outputs and transfer to GPT-5.1 and other proprietary VLMs

Input Manipulation Attack Prompt Injection visionmultimodalnlp
1 citations PDF
attack arXiv Nov 10, 2025 · Nov 2025

On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch et al. · Polish Academy of Sciences · IDEAS NCBR +5 more

Steals GNN models with as few as 100 queries by decoupling query-free backbone extraction from strategic head extraction

Model Theft graph
PDF Code
defense arXiv Oct 9, 2025 · Oct 2025

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Stanisław Pawlak, Jan Dubiński, Daniel Marczak et al. · Warsaw University of Technology · NASK National Research Institute +3 more

Proposes Backdoor Vectors to unify backdoor attacks in model merging, plus stronger SBV attack and assumption-free IBVS defense

Model Poisoning visionmultimodal
PDF
benchmark arXiv Oct 1, 2025 · Oct 2025

Eliciting Secret Knowledge from Language Models

Bartosz Cywiński, Emil Ryd, Rowan Wang et al. · arXiv · Senthooran Rajamanoharan IDEAS Research Institute +3 more

Benchmarks black-box and white-box techniques for auditing LLMs that secretly apply but deny hidden knowledge

Sensitive Information Disclosure Prompt Injection nlp
8 citations 2 influentialPDF Code