Latest papers

3 papers
benchmark arXiv Sep 21, 2025 · Sep 2025

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

Ilham Wicaksono, Zekun Wu, Rahul Patel et al. · University College London · Holistic AI

Compares jailbreak attacks on standalone LLM vs. agentic loop, discovering agentic-only vulnerabilities with 24% higher ASR in tool-calling contexts

Prompt Injection Excessive Agency nlp
PDF
benchmark arXiv Sep 5, 2025 · Sep 2025

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Ilham Wicaksono, Zekun Wu, Rahul Patel et al. · Holistic AI · University College London

AgentSeer framework reveals LLM agent tool-calling suffers 24-60% higher jailbreak ASR than standalone model-level safety evaluation

Prompt Injection Excessive Agency nlp
PDF
defense arXiv Aug 18, 2025 · Aug 2025

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Seonglae Cho, Zekun Wu, Adriano Koshiyama · Holistic AI · University College London

Steers LLMs at inference time via correlated SAE features to prevent jailbreaks, improving HarmBench by 27.2% with 108 samples

Prompt Injection nlp
PDF