Latest papers

4 papers
attack arXiv Jan 23, 2026 · 10w ago

Persona Jailbreaking in Large Language Models

Jivnesh Sandhan, Fei Cheng, Tushar Sandhan et al. · Kyoto University · Indian Institute of Technology Kanpur

Black-box attack gradually hijacks LLM personas via adversarial conversational history, bypassing guardrails across 8 LLMs

Prompt Injection nlp
PDF Code
defense arXiv Jan 9, 2026 · 12w ago

Can We Trust LLM Detectors?

Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng et al. · Kyoto University · IIT Kanpur

Exposes brittleness of LLM text detectors under domain shift; proposes supervised contrastive learning framework for robust AI-text detection

Output Integrity Attack nlp
PDF Code
benchmark arXiv Jan 4, 2026 · Jan 2026

JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

Junyu Liu, Zirui Li, Qian Niu et al. · Kyoto University · Hohai University +3 more

Benchmarks 27 LLMs against 50K+ multi-turn medical jailbreak conversations in Japanese, finding fine-tuned medical models are most vulnerable

Prompt Injection nlp
PDF
defense arXiv Sep 3, 2025 · Sep 2025

Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Kaoru Otsuka, Yuki Takezawa, Makoto Yamada · Okinawa Institute of Science and Technology · Kyoto University

Defends federated learning against Byzantine clients under partial participation via delayed momentum aggregation to dilute malicious updates

Data Poisoning Attack federated-learning
PDF