Latest papers

10 papers
attack arXiv Mar 25, 2026 · 14d ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp
PDF
defense arXiv Jan 12, 2026 · 12w ago

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment

Haozhong Wang, Zhuo Li, Yibo Yang et al. · Jilin University

Defends LLM safety alignment during fine-tuning via Optimal Transport-based distributional reweighting away from harmful data

Transfer Learning Attack Prompt Injection nlp
PDF
defense arXiv Dec 21, 2025 · Dec 2025

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Junjun Pan, Yixin Liu, Rui Miao et al. · Griffith University · Jilin University +1 more

Defends LLM multi-agent systems by detecting malicious agents using bi-level graph anomaly detection with token-level explainability

Excessive Agency nlpgraph
1 citations PDF
benchmark arXiv Dec 11, 2025 · Dec 2025

TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin, Kang-Yang Huang, Ling Zou et al. · National Taiwan University · National Yang Ming Chiao Tung University +1 more

Benchmark for evaluating MLLMs on interpretable deepfake detection across perception, detection, and hallucination dimensions

Output Integrity Attack visionaudiomultimodalnlp
PDF
attack arXiv Nov 26, 2025 · Nov 2025

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He, Guanyu Hou, Hongwei Li et al. · University of Electronic Science and Technology of China · University of Manchester +3 more

Automated red-teaming framework crafts temporally-aware prompts to jailbreak T2V model safety filters, achieving 80%+ attack success rate

Prompt Injection visionnlpgenerativemultimodal
PDF
defense arXiv Nov 7, 2025 · Nov 2025

Deep learning models are vulnerable, but adversarial examples are even more vulnerable

Jun Li, Yanwei Xu, Keran Li et al. · Jilin University of Finance and Economics · Center for Artificial Intelligence +1 more

Detects adversarial examples via sliding-window occlusion confidence entropy, achieving up to 96.5% detection on CIFAR-10 across nine attacks

Input Manipulation Attack vision
PDF
attack arXiv Nov 6, 2025 · Nov 2025

P-MIA: A Profiled-Based Membership Inference Attack on Cognitive Diagnosis Models

Mingliang Hou, Yinuo Wang, Teng Guo et al. · Jilin University · TAL Education Group +1 more

Grey-box membership inference attack on educational cognitive diagnosis models exploiting exposed knowledge state visualizations

Membership Inference Attack tabular
1 citations PDF
defense arXiv Oct 16, 2025 · Oct 2025

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang, Yibo Yang, Zhe Ren et al. · Jilin University · King Abdullah University of Science and Technology +1 more

Defends LLM safety alignment during fine-tuning by freezing safety-relevant weight subspaces and projecting adapter updates into a harmful-resistant null space

Transfer Learning Attack Prompt Injection nlp
3 citations PDF
defense arXiv Aug 13, 2025 · Aug 2025

CLIP-Flow: A Universal Discriminator for AI-Generated Images Inspired by Anomaly Detection

Zhipeng Yuan, Kai Wang, Weize Quan et al. · Jilin University · University of Grenoble Alpes +1 more

Anomaly-detection-inspired CLIP-based normalizing flow detector identifies AI-generated images without seeing any during training

Output Integrity Attack vision
PDF
defense arXiv Aug 11, 2025 · Aug 2025

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao, Yixin Liu, Yili Wang et al. · Jilin University · Griffith University +1 more

Unsupervised malicious-agent detector for LLM multi-agent systems using contrastive learning without requiring labeled attack data

Excessive Agency Prompt Injection nlpgraph
PDF Code