defense 2025

LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis

Fanwei Zeng 1, Changtao Miao 1, Jing Huang 1, Zhiya Tan 2, Shutao Gong 1, Xiaoming Yu 1, Yang Wang 1, Huazhe Tan 1, Weibin Yao 1, Jianshu Li 1

1 citations · 47 references · arXiv

α

Published on arXiv

2512.21482

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

In zero-shot evaluation on T-IC13, LogicLens surpasses the specialized baseline by 41.4% and GPT-4o by 23.4% in macro-average F1 score.

LogicLens (Cross-Cues-aware Chain of Thought / CCT)

Novel technique introduced


Sophisticated text-centric forgeries, fueled by rapid AIGC advancements, pose a significant threat to societal security and information authenticity. Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis and lack the capacity for sophisticated reasoning. Moreover, they typically treat detection, grounding, and explanation as discrete sub-tasks, overlooking their intrinsic relationships for holistic performance enhancement. To address these challenges, we introduce LogicLens, a unified framework for Visual-Textual Co-reasoning that reformulates these objectives into a joint task. The deep reasoning of LogicLens is powered by our novel Cross-Cues-aware Chain of Thought (CCT) mechanism, which iteratively cross-validates visual cues against textual logic. To ensure robust alignment across all tasks, we further propose a weighted multi-task reward function for GRPO-based optimization. Complementing this framework, we first designed the PR$^2$ (Perceiver, Reasoner, Reviewer) pipeline, a hierarchical and iterative multi-agent system that generates high-quality, cognitively-aligned annotations. Then, we constructed RealText, a diverse dataset comprising 5,397 images with fine-grained annotations, including textual explanations, pixel-level segmentation, and authenticity labels for model training. Extensive experiments demonstrate the superiority of LogicLens across multiple benchmarks. In a zero-shot evaluation on T-IC13, it surpasses the specialized framework by 41.4% and GPT-4o by 23.4% in macro-average F1 score. Moreover, on the challenging dense-text T-SROIE dataset, it establishes a significant lead over other MLLM-based methods in mF1, CSS, and the macro-average F1. Our dataset, model, and code will be made publicly available.


Key Contributions

  • LogicLens: a unified visual-textual co-reasoning framework that jointly performs forgery detection, grounding, and explanation via a novel Cross-Cues-aware Chain of Thought (CCT) mechanism
  • PR² (Perceiver, Reasoner, Reviewer) hierarchical multi-agent annotation pipeline producing cognitively-aligned, fine-grained annotations for training
  • RealText dataset: 5,397 images with pixel-level segmentation, textual explanations, and authenticity labels for text-centric forgery analysis

🛡️ Threat Analysis

Output Integrity Attack

LogicLens is a novel detection architecture for AI-generated content (text-centric image forgeries), directly addressing output integrity and content authenticity — a prototypical ML09 use case. The paper introduces new forensic mechanisms (CCT), a new benchmark dataset, and a new evaluation pipeline specifically for verifying the authenticity of AI-manipulated image content.


Details

Domains
visionnlpmultimodal
Model Types
vlmtransformermultimodal
Threat Tags
inference_time
Datasets
RealTextT-IC13T-SROIE
Applications
text-centric image forgery detectiondocument authenticity verificationai-generated content detection