Latest papers

16 papers
defense arXiv Feb 23, 2026 · 6w ago

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Amirhossein Farzam, Majid Behabahani, Mani Malek et al. · Duke University · Princeton University +3 more

Detects concealed LLM jailbreaks by disentangling goal and framing signals in internal activation space

Prompt Injection nlp
PDF
attack arXiv Nov 21, 2025 · Nov 2025

MURMUR: Using cross-user chatter to break collaborative language agents in groups

Atharv Singh Patlan, Peiyao Sheng, S. Ashwin Hebbar et al. · Princeton University · Sentient

Discovers cross-user poisoning: adversarial messages in shared LLM agent history hijack actions of other users at inference time

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Nov 11, 2025 · Nov 2025

3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence

Eren Kurshan, Yuan Xie, Paul Franzon · Princeton University · Hong Kong University of Science and Technology +1 more

Proposes 3D-integrated hardware safety layer for edge AI systems that dynamically detects and mitigates inference-time network attacks

Input Manipulation Attack Excessive Agency visionnlp
PDF
attack IEEE IoT-J Nov 10, 2025 · Nov 2025

Adversarial Node Placement in Decentralized Federated Learning: Maximum Spanning-Centrality Strategy and Performance Analysis

Adam Piaseczny, Eric Ruzomberka, Rohit Parasnis et al. · Purdue University · Princeton University +1 more

Proposes MaxSpAN-FL, a hybrid topology-aware strategy for placing Byzantine nodes in decentralized FL to maximize model degradation

Data Poisoning Attack federated-learning
PDF
benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative
PDF
defense arXiv Oct 24, 2025 · Oct 2025

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks

Mahavir Dabas, Tran Huynh, Nikhil Reddy Billa et al. · Virginia Tech · Princeton University +1 more

Defends LLMs against novel jailbreaks by training on diverse compositions of adversarial skill primitives extracted from 32 prior attacks

Prompt Injection nlp
1 citations PDF
defense arXiv Oct 23, 2025 · Oct 2025

Adversary-Aware Private Inference over Wireless Channels

Mohamed Seif, Malcolm Egan, Andrea J. Goldsmith et al. · Princeton University · INRIA +1 more

Defends against adversarial inversion of ML feature embeddings during wireless transmission using differential privacy and channel-aware encoding

Model Inversion Attack vision
PDF
defense arXiv Oct 15, 2025 · Oct 2025

Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks

Jianzhu Yao, Hongxu Su, Taobo Liao et al. · Princeton University · HKUST (GZ) +1 more

Verifiable inference protocol for cloud ML that detects model swaps and computation tampering with 0.3% overhead using IEEE-754 bounds and Merkle-anchored dispute games

Output Integrity Attack visionnlpgenerative
2 citations PDF
defense arXiv Oct 2, 2025 · Oct 2025

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif et al. · University of Minnesota · Princeton University +2 more

Combinatorial vocabulary-partitioning watermark for LLM text that detects and localizes post-generation edits and spoofing attacks

Output Integrity Attack nlp
1 citations PDF
attack arXiv Sep 30, 2025 · Sep 2025

Are Robust LLM Fingerprints Adversarially Robust?

Anshul Nasery, Edoardo Contente, Alkin Kaz et al. · University of Washington · Sentient +1 more

Adaptive attacks bypass ten LLM fingerprinting schemes with near-perfect success by exploiting four systemic vulnerabilities in ownership verification

Model Theft Model Theft nlp
3 citations PDF
benchmark arXiv Sep 30, 2025 · Sep 2025

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Zeshi Dai, Zimo Peng, Zerui Cheng et al. · Surf AI · Princeton University

Benchmarks 17 LLM agents against adversarial financial misinformation, revealing systematic tool-selection failures and indirect prompt injection via SEO-poisoned web search

Prompt Injection Excessive Agency nlp
3 citations PDF
defense arXiv Sep 27, 2025 · Sep 2025

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu et al. · Princeton University · NVIDIA

Defends RAG-based search against corpus poisoning using graph-theoretic document reliability filtering with provable robustness guarantees

Input Manipulation Attack Prompt Injection nlp
2 citations PDF
benchmark arXiv Sep 26, 2025 · Sep 2025

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

Xingyu Fu, Siyi Liu, Yinuo Xu et al. · Princeton University · University of Pennsylvania +1 more

Introduces a spatiotemporally grounded benchmark and multimodal reward model for detecting human-perceived traces of AI-generated video fakeness

Output Integrity Attack visionmultimodalgenerative
2 citations 2 influentialPDF
defense bioRxiv Sep 13, 2025 · Sep 2025

A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment

Meiyin Meng, Zaixi Zhang · Imperial College London · Princeton University

Lifecycle LLM biosecurity defense combining data sanitization, DPO alignment, and runtime guardrails to cut jailbreak ASR from 59.7% to 3.0%

Prompt Injection nlp
PDF
benchmark arXiv Sep 3, 2025 · Sep 2025

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin et al. · Peking University · Stanford University +3 more

Red-teams protein foundation models via multimodal prompt engineering and beam search, achieving 70% jailbreak success rate bypassing ESM3 safety filters

Prompt Injection nlpgenerative
PDF Code
benchmark arXiv Aug 15, 2025 · Aug 2025

Assessing User Privacy Leakage in Synthetic Packet Traces: An Attack-Grounded Approach

Minhao Jin, Hongyu He, Maria Apostolaki · Princeton University

Benchmarks privacy of ML-based synthetic traffic generators via novel membership inference attack using contrastive learning, revealing critical user-level leakage

Membership Inference Attack generativetimeseries
PDF