Latest papers

16 papers
defense CVPR Mar 9, 2026 · 28d ago

X-AVDT: Audio-Visual Cross-Attention for Robust Deepfake Detection

Youngseo Kim, Kwan Yun, Seokhyeon Hong et al. · KAIST

Detects audio-visual deepfakes by probing generator-internal cross-attention cues via DDIM inversion, outperforming baselines by 13.1%

Output Integrity Attack visionaudiomultimodalgenerative
PDF
defense arXiv Mar 3, 2026 · 4w ago

ExpGuard: LLM Content Moderation in Specialized Domains

Minseok Choi, Dongjin Kim, Seungbin Yang et al. · KAIST · Kakaobank

Proposes domain-specialized LLM guardrail for financial, medical, and legal contexts, outperforming WildGuard on adversarial prompt and response classification

Prompt Injection nlp
PDF
attack arXiv Feb 6, 2026 · 8w ago

Subgraph Reconstruction Attacks on Graph RAG Deployments with Practical Defenses

Minkyoo Song, Jaehan Kim, Myungchul Kang et al. · KAIST · National Security Research Institute

Attacks Graph RAG systems to reconstruct proprietary knowledge graphs via multi-turn prompting, reaching 82.9 F1 against safety-aligned LLMs

Sensitive Information Disclosure nlpgraph
PDF
attack arXiv Feb 4, 2026 · 8w ago

When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models

Jaehyun Kwak, Nam Cao, Boryeong Cho et al. · KAIST · KENTECH

Attention-guided adversarial attack on vision-language models that progressively concentrates perturbations on high-attention image regions for SOTA attack efficiency

Input Manipulation Attack Prompt Injection visionmultimodal
PDF Code
defense arXiv Jan 1, 2026 · Jan 2026

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

Hyunjun Kim · KAIST

Trains LLM guardrail models on compressed conversation summaries to detect multi-turn jailbreaks 94% more efficiently

Prompt Injection nlp
PDF
defense arXiv Dec 22, 2025 · Dec 2025

WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Utae Jeong, Sumin In, Hyunju Ryu et al. · Korea University · Google DeepMind +1 more

Defends image watermark provenance against image-to-video conversion using optical-flow consistency and diffusion-proxy training

Output Integrity Attack visiongenerative
PDF
defense arXiv Dec 11, 2025 · Dec 2025

Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Hongsin Lee, Hye Won Chung · KAIST

Defends compact models against adversarial examples by reweighting distillation samples based on adversarial transferability between student and teacher

Input Manipulation Attack vision
PDF Code
defense arXiv Dec 4, 2025 · Dec 2025

Rethinking the Use of Vision Transformers for AI-Generated Image Detection

NaHyeon Park, Kunhee Kim, Junsuk Choe et al. · KAIST · Sogang University

Proposes MoLD, a gating-based multi-layer ViT feature fusion method that improves AI-generated image detection across GANs and diffusion models

Output Integrity Attack visiongenerative
1 citations 1 influentialPDF
defense arXiv Nov 6, 2025 · Nov 2025

Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models

Jiwoo Shin, Byeonghu Na, Mina Kang et al. · KAIST · summary.ai

Defends unlearned text-to-image models against harmful prompts by replacing explicit negative prompts with concept-inverted implicit embeddings

Prompt Injection generative
PDF
attack arXiv Sep 29, 2025 · Sep 2025

Takedown: How It's Done in Modern Coding Agent Exploits

Eunkyu Lee, Donghyeon Kim, Wonyoung Kim et al. · KAIST

Exploits 15 vulnerabilities in 8 real coding agents via insecure tool design, achieving command execution and data exfiltration without user interaction

Insecure Plugin Design Excessive Agency nlp
3 citations PDF
defense arXiv Sep 26, 2025 · Sep 2025

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Jaehan Kim, Minkyoo Song, Seungwon Shin et al. · KAIST

Defends MoE LLMs against harmful fine-tuning by penalizing routing drift away from safety-critical experts

Transfer Learning Attack Prompt Injection nlp
3 citations 1 influentialPDF Code
attack arXiv Sep 26, 2025 · Sep 2025

Active Attacks: Red-teaming LLMs via Adaptive Environments

Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park et al. · KAIST · Mila – Québec AI Institute +3 more

RL-based red-teaming algorithm generates diverse LLM jailbreak prompts via adaptive victim fine-tuning, achieving 440x better coverage than GFlowNets

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Aug 23, 2025 · Aug 2025

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

Hyunjun Kim, Junwoo Ha, Sangyoon Yu et al. · AIM Intelligence · KAIST +2 more

Benchmarks LLM judges on recovering hidden jailbreak objectives in multi-turn transcripts and calibrating their own confidence in safety evaluations

Prompt Injection nlp
PDF Code
defense arXiv Aug 19, 2025 · Aug 2025

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Dongyoon Hahm, Taywon Min, Woogyeol Jin et al. · KAIST

Discovers fine-tuning LLMs on benign agentic tasks erodes safety alignment; proposes PING prefix-injection defense for agents

Transfer Learning Attack Excessive Agency nlp
PDF Code
defense arXiv Aug 17, 2025 · Aug 2025

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim, Jin Myung Kwak, Lama Alssum et al. · Microsoft Research · KAIST +5 more

Defends LLM safety during fine-tuning via hyperparameter tuning and EMA momentum, cutting harmful responses from 16% to 5%

Transfer Learning Attack Prompt Injection nlp
PDF
benchmark arXiv Jan 9, 2025 · Jan 2025

On Measuring Unnoticeability of Graph Adversarial Attacks: Observations, New Measure, and Applications

Hyeonsoo Jo, Hyunjin Hwang, Fanchen Bu et al. · KAIST

Proposes HideNSeek, a learnable graph attack noticeability measure that outperforms 11 baselines in identifying adversarial edges on GNNs

Input Manipulation Attack graph
PDF Code