attack 2026

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

0 citations · 30 references · arXiv

Published on arXiv

2601.21233

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

JustAsk achieves 100% system prompt extraction success (consistency score ≥ 0.7) across 41 black-box commercial models, recovering prompts including Claude Code's 6,973-token system instructions with 0.94 semantic similarity to ground truth.

JustAsk

Novel technique introduced

Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \textbf{\textsc{JustAsk}}, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \textsc{JustAsk} requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound-based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on \textbf{41} black-box commercial models across multiple providers, \textsc{JustAsk} consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.

Key Contributions

JustAsk: a self-evolving, dataset-free agent framework for system prompt extraction using UCB-based strategy selection over a hierarchical skill space (14 atomic probes + 14 multi-turn orchestration strategies)
Demonstrates 100% system prompt extraction success across 41 black-box commercial LLMs with no handcrafted prompts, labeled data, or privileged access
Reveals recurring architecture-level vulnerabilities including near-universal HHH framework adoption and exploitable helpfulness/safety tensions in production code agents

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

41 black-box commercial LLMs across multiple providers

Applications

code agentsllm-based assistantsclaude codecursorgithub copilot

Read PDF arXiv DOI

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage

Bypassing Prompt Guards in Production with Controlled-Release Prompting

CLIOPATRA: Extracting Private Information from LLM Insights

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

Tricking LLM-Based NPCs into Spilling Secrets

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

External Data Extraction Attacks against Retrieval-Augmented Large Language Models