Latest papers

2 papers
attack arXiv Oct 24, 2025 · Oct 2025

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Pavlos Ntais · University of Athens

Trains compact LoRA-tuned Mistral-7B to auto-generate narrative jailbreaks, achieving 81% ASR against GPT-OSS-20B and 66.5% against GPT-4

Prompt Injection nlp
1 citations PDF
attack arXiv Oct 17, 2025 · Oct 2025

Language Models are Injective and Hence Invertible

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi et al. · EPFL · Archimedes/Athena RC +3 more

Proves LLMs are injective and introduces SipIt to exactly reconstruct private input text from hidden activations

Model Inversion Attack Sensitive Information Disclosure nlp
15 citations 3 influentialPDF