survey 2025

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Chi Zhang ¹, Changjia Zhu ¹, Junjie Xiong ², Xiaoran Xu ¹, Lingyao Li ¹, Yao Liu ¹, Zhuo Lu ¹

¹ University of South Florida

² Missouri University of Science and Technology

0 citations

Published on arXiv

2508.05775

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Identifies a dual trajectory where LLMs are simultaneously sources of harmful content and promising tools for safety, while highlighting critical limitations in current evaluation methodologies for LLM safety.

Large Language Models (LLMs) have revolutionized content creation across digital platforms, offering unprecedented capabilities in natural language generation and understanding. These models enable beneficial applications such as content generation, question and answering (Q&A), programming, and code reasoning. Meanwhile, they also pose serious risks by inadvertently or intentionally producing toxic, offensive, or biased content. This dual role of LLMs, both as powerful tools for solving real-world problems and as potential sources of harmful language, presents a pressing sociotechnical challenge. In this survey, we systematically review recent studies spanning unintentional toxicity, adversarial jailbreaking attacks, and content moderation techniques. We propose a unified taxonomy of LLM-related harms and defenses, analyze emerging multimodal and LLM-assisted jailbreak strategies, and assess mitigation efforts, including reinforcement learning with human feedback (RLHF), prompt engineering, and safety alignment. Our synthesis highlights the evolving landscape of LLM safety, identifies limitations in current evaluation methodologies, and outlines future research directions to guide the development of robust and ethically aligned language technologies.

Key Contributions

Unified taxonomy of LLM-related harms and defenses spanning unintentional toxicity, adversarial jailbreaks, and content moderation
Analysis of emerging multimodal and LLM-assisted jailbreak strategies
Assessment of mitigation efforts including RLHF, prompt engineering, and safety alignment, with identified gaps in current evaluation methodologies

🛡️ Threat Analysis

Details

Domains

nlpmultimodal

Model Types

llmvlm

Threat Tags

inference_timetargeteddigital

Applications

content moderationchatbottext generation

Read PDF arXiv

Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

SafeMT: Multi-turn Safety for Multimodal Language Models

Text is All You Need for Vision-Language Model Jailbreaking

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs