Haon Park

Papers in Database (4)

benchmark arXiv Aug 6, 2025 · Aug 2025

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil, Hiskias Dingeto, Haon Park · AIM Intelligence · Seoul National University

Red-teams frontier LLMs via narrative/emotional manipulation scenarios, achieving 76% misalignment rate without jailbreaking

Prompt Injection nlp
PDF Code
attack arXiv Sep 10, 2025 · Sep 2025

X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates

Hyunjun Kim, Junwoo Ha, Sangyoon Yu et al. · AIM Intelligence

Automates discovery of single-turn jailbreak templates via LLM-guided evolution, achieving 44.8% success on GPT-4.1

Prompt Injection nlp
PDF Code
benchmark arXiv Aug 23, 2025 · Aug 2025

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu et al. · AIM Intelligence · KAIST +2 more

Benchmarks LLM judges on recovering hidden jailbreak objectives in multi-turn transcripts and calibrating their own confidence in safety evaluations

Prompt Injection nlp
PDF Code
attack arXiv Aug 5, 2025 · Aug 2025

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi et al. · AIM Intelligence · Seoul National University +3 more

Two-stage gradient-based attack embeds harmful payloads in benign audio to jailbreak audio-language models via RL-PGD optimization

Input Manipulation Attack Prompt Injection audiomultimodalnlp
PDF