Itay Yona

attack arXiv Oct 21, 2025 · Oct 2025

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo et al. · University of Oxford · National University of Singapore +4 more

Extracts LLM alignment training data via chat template prompting, finding embedding similarity reveals 10x more memorization than string matching

Model Inversion Attack Sensitive Information Disclosure nlp

4 citations PDF

attack arXiv Dec 3, 2025 · Dec 2025

Itay Yona, Amir Sarid, Michael Karasik et al. · MentaLeap · Independent Researcher +1 more

Jailbreaks LLMs by replacing harmful keywords with benign substitutes in-context, hijacking internal representations to bypass safety alignment

Prompt Injection nlp

Papers in Database (2)