Itay Yona

h-index: 8 2,112 citations 12 papers (total)

Papers in Database (2)

attack arXiv Oct 21, 2025 · Oct 2025

Extracting alignment data in open models

Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo et al. · University of Oxford · National University of Singapore +4 more

Extracts LLM alignment training data via chat template prompting, finding embedding similarity reveals 10x more memorization than string matching

Model Inversion Attack Sensitive Information Disclosure nlp
4 citations PDF
attack arXiv Dec 3, 2025 · Dec 2025

In-Context Representation Hijacking

Itay Yona, Amir Sarid, Michael Karasik et al. · MentaLeap · Independent Researcher +1 more

Jailbreaks LLMs by replacing harmful keywords with benign substitutes in-context, hijacking internal representations to bypass safety alignment

Prompt Injection nlp
PDF Code