Jirui Yang

h-index: 2 25 citations 14 papers (total)

Papers in Database (2)

defense arXiv Dec 18, 2025 · Dec 2025

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Jirui Yang, Hengqi Guo, Zhihui Lu et al. · Fudan University · Ant Group +1 more

Defends LLMs against harmful prompts by comparing refusal vs. agreement prefix log-probabilities with near-zero inference overhead

Prompt Injection nlp

PDF

benchmark arXiv Nov 18, 2025 · Nov 2025

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin, Jirui Yang, Yukui Qiu et al. · University of California · Fudan University +1 more

Proposes latent-trajectory metric to benchmark LLM jailbreak robustness without text generation, matching red-teaming rankings at under 1% compute cost

Prompt Injection nlp

PDF