Latest papers

30 papers
benchmark arXiv Mar 12, 2026 · 25d ago

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

Junjie Chu, Yiting Qu, Ye Leng et al. · CISPA Helmholtz Center for Information Security · Delft University of Technology

Benchmarks LLM safety alignment failures when harmful content is embedded in benign tasks like translation, revealing a content-level ethical blind spot

Prompt Injection nlp
PDF
defense arXiv Mar 3, 2026 · 4w ago

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more

Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality

Prompt Injection visionmultimodalgenerative
PDF Code
attack arXiv Mar 1, 2026 · 5w ago

Turning Black Box into White Box: Dataset Distillation Leaks

Huajie Chen, Tianqing Zhu, Yuchen Zhong et al. · City University of Macau · CISPA Helmholtz Center for Information Security +2 more

Reveals that dataset distillation leaks training data via three-stage attack: architecture inference, membership inference, and model inversion

Model Inversion Attack Membership Inference Attack vision
PDF
attack arXiv Mar 1, 2026 · 5w ago

Hide&Seek: Remove Image Watermarks with Negligible Cost via Pixel-wise Reconstruction

Huajie Chen, Tianqing Zhu, Hailin Yang et al. · City University of Macau · CISPA Helmholtz Center for Information Security +1 more

Pixel-wise reconstruction attack removes AI-image watermarks without querying detectors or knowing the watermarking scheme

Output Integrity Attack visiongenerative
PDF
attack arXiv Feb 28, 2026 · 5w ago

Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning

Dariush Wahdany, Matthew Jagielski, Adam Dziedzic et al. · CISPA Helmholtz Center for Information Security · Anthropic

Membership inference attacks expose private data leakage in curation pipelines even when models train only on public data

Membership Inference Attack vision
PDF
attack arXiv Feb 9, 2026 · 8w ago

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang, Hai Huang, Mingjie Li et al. · CISPA Helmholtz Center for Information Security

Discovers unsafe routing configurations in MoE LLMs that bypass safety alignment, achieving 0.98 ASR on AdvBench via router optimization

Prompt Injection nlp
PDF Code
attack arXiv Jan 29, 2026 · 9w ago

Hardware-Triggered Backdoors

Jonas Möller, Erik Imgrund, Thorsten Eisenhofer et al. · Berlin Institute for the Foundations of Learning and Data · TU Berlin +1 more

Exploits GPU floating-point numerical variations to inject hardware-specific backdoors that flip model predictions only on targeted accelerators

Model Poisoning vision
PDF
benchmark arXiv Jan 26, 2026 · 10w ago

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Devansh Srivastav, David Pape, Lea Schönherr · CISPA Helmholtz Center for Information Security

Taxonomizes hidden covert LLM behaviors induced by adversarial developers and shows detection systematically fails in open-world conditions

Model Poisoning Prompt Injection nlp
PDF
defense arXiv Jan 8, 2026 · 12w ago

Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Polina Dolgova, Sebastian U. Stich · CISPA Helmholtz Center for Information Security · Universität des Saarlandes

Defends against membership inference on forgotten data via block-wise noise injection that preserves certified (ε,δ) unlearning guarantees with far less accuracy loss

Membership Inference Attack vision
PDF
benchmark arXiv Dec 30, 2025 · Dec 2025

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

Yuan Xin, Dingfan Chen, Linyi Yang et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Intelligent Systems +1 more

Benchmarks jailbreak attacks against full LLM deployment pipelines with safety filters, finding prior studies overestimated attack success

Prompt Injection nlp
PDF
survey arXiv Dec 10, 2025 · Dec 2025

Chasing Shadows: Pitfalls in LLM Security Research

Jonathan Evertz, Niklas Risse, Nicolai Neuer et al. · CISPA Helmholtz Center for Information Security · Max Planck Institute for Security and Privacy +4 more

Surveys nine methodological pitfalls in LLM security research found in all 72 surveyed papers, with case studies showing how each misleads results

Data Poisoning Attack Prompt Injection nlp
2 citations PDF
tool arXiv Nov 24, 2025 · Nov 2025

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Yixin Wu, Rui Wen, Chi Cui et al. · CISPA Helmholtz Center for Information Security · Institute of Science Tokyo

Autonomous LLM agent automates membership inference, model stealing, and data reconstruction attacks on ML services with near-expert accuracy at $0.627/run.

Membership Inference Attack Model Theft Model Inversion Attack nlp
PDF
attack arXiv Nov 10, 2025 · Nov 2025

On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch et al. · Polish Academy of Sciences · IDEAS NCBR +5 more

Steals GNN models with as few as 100 queries by decoupling query-free backbone extraction from strategic head extraction

Model Theft graph
PDF Code
tool arXiv Oct 31, 2025 · Oct 2025

From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang, Yiting Qu, Yukun Jiang et al. · CISPA Helmholtz Center for Information Security

Multi-agent forensic framework with LLM debate and memory module achieves 97% accuracy on AI-generated image detection

Output Integrity Attack visionnlp
1 citations PDF
attack arXiv Oct 24, 2025 · Oct 2025

Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes et al. · CISPA Helmholtz Center for Information Security

Jailbreaks LLMs by interleaving harmful and benign task words, hiding malicious intent from safety guardrails with 95% attack success rate

Prompt Injection nlp
9 citations 1 influentialPDF Code
defense arXiv Oct 24, 2025 · Oct 2025

Soft Instruction De-escalation Defense

Nils Philipp Walter, Chawin Sitawarin, Jamie Hayes et al. · CISPA Helmholtz Center for Information Security · Google DeepMind +1 more

Defends LLM agents against indirect prompt injection via iterative sanitization, limiting adversarial attack success rate to 15%

Prompt Injection nlp
2 citations PDF
benchmark arXiv Oct 16, 2025 · Oct 2025

When Flatness Does (Not) Guarantee Adversarial Robustness

Nils Philipp Walter, Linara Adilova, Jilles Vreeken et al. · CISPA Helmholtz Center for Information Security · Ruhr University Bochum +3 more

Formally proves loss landscape flatness guarantees only local adversarial robustness; adversarial examples inhabit flat, confidently-wrong regions

Input Manipulation Attack vision
3 citations PDF
attack arXiv Oct 10, 2025 · Oct 2025

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

Subrat Kishore Dutta, Yuelin Xu, Piyush Pant et al. · CISPA Helmholtz Center for Information Security

Backdoor attack on RLHF preference data using emotion-aware triggers that generalizes to unseen angry-user inputs

Model Poisoning Transfer Learning Attack nlpreinforcement-learning
PDF
defense arXiv Sep 29, 2025 · Sep 2025

Defeating Cerberus: Concept-Guided Privacy-Leakage Mitigation in Multimodal Language Models

Boyang Zhang, Istemi Ekin Akkus, Ruichuan Chen et al. · CISPA Helmholtz Center for Information Security · Nokia Bell Labs

Concept-guided weight editing prevents VLMs from leaking or processing PII with 93.3% refusal rate and no retraining needed

Sensitive Information Disclosure visionnlpmultimodal
PDF
attack arXiv Sep 25, 2025 · Sep 2025

Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov, Lea Schönherr, Timo Gerkmann · University of Hamburg · CISPA Helmholtz Center for Information Security

Proposes targeted white-box adversarial attacks on speech enhancement models that psychoacoustically hide perturbations to alter output semantics

Input Manipulation Attack audio
PDF Code
Loading more papers…