Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

1University of Liverpool    2University of Exeter    3The Chinese University of Hong Kong, Shenzhen
* Equal Contribution
Framework comparison between existing methods and Omni-Fake

Figure 1. Framework comparison. Existing non-Omni methods handle only single-modality inputs. Omni-Fake-R1, built on a unified omni MLLM, supports four modalities (Image, Video, Audio, AV-TH) with joint detection–localization–explanation.

At a Glance
1M+
Training Samples
(Omni-Fake-Set)
100K+
OOD Benchmark Samples
(Omni-Fake-OOD)
4
Modalities
Image / Audio / Video / AV-TH
3-in-1
Joint Protocol
Detection + Localization + Explanation

Abstract

Multimodal Deepfakes proliferating on social media threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 100k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection–localization–explanation protocol.

On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines.

Multimodal Coverage

Four Modalities, One Unified Framework

Overall view of four modalities in the unified Omni-Fake framework
🖼️

Image

Face swap, inpainting, generation, and tampered image detection with pixel-level localization

🎵

Audio

Speech synthesis, voice cloning, and audio deepfake detection with temporal interval localization

🎬

Video

Face reenactment, video generation, and manipulation detection with frame-level analysis

🗣️

Audio-Video TH

Talking head forgery with cross-modal audio-visual consistency verification

Contributions

Key Contributions

  • Omni-Fake-Set — A large-scale, high-quality unified dataset with 1M+ samples spanning four modalities (image, audio, video, AV talking head), collected from diverse real-world social media sources with modern generators.
  • Omni-Fake-OOD — An out-of-distribution benchmark with 100K+ samples, intentionally excluded from training, to rigorously evaluate cross-generator and cross-modality generalization.
  • Joint Detection–Localization–Explanation Protocol — A unified evaluation framework that goes beyond binary classification to include fine-grained localization (pixel masks / temporal intervals) and natural-language explanations for trustworthy forensic analysis.
  • Omni-Fake-R1 — A reinforcement-learning-driven multimodal detector built on a unified omni MLLM that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and explanations.
Dataset

Omni-Fake Dataset Overview

Omni-Fake dataset overview showing four modalities and data statistics
Omni-Fake-Set contains 1M+ high-quality samples across four modalities with fine-grained annotations including binary labels, tampered region masks, temporal intervals, and human-written explanations. The dataset covers a wide range of modern generators (Sora, Kling, WanX, etc.) and real-world social media post-processing pipelines.
Omni-Fake-OOD provides 100K+ out-of-distribution samples from held-out generators and platforms, enabling rigorous evaluation of model robustness against unseen manipulation techniques and distribution shifts.
Comparison

Dataset Comparison with Existing Benchmarks

Omni-Fake provides comprehensive coverage that existing benchmarks lack.

Dataset comparison with existing benchmarks
Method

Omni-Fake-R1: RL-Driven Multimodal Detector

Omni-Fake-R1 architecture overview

Omni-Fake-R1 is built on a unified omni MLLM backbone and employs reinforcement learning to adaptively integrate visual and auditory cues. The model outputs three structured components:

Detection: Binary real/fake classification with confidence scores across all four modalities.
Localization: Fine-grained tampered region masks for images/videos and temporal intervals for audio.
Explanation: Natural-language explanations describing the nature and evidence of detected manipulations.
Examples

Experimental Examples

Omni-Fake-R1 shows significant gains in detection accuracy, cross-modal generalization, and explainability.

Qualitative

Qualitative Results

Below we summarize Omni-Fake-R1 performance on the large-scale Omni-Fake-Set (in-distribution) and on the held-out Omni-Fake-OOD benchmark, illustrating detection behavior under both standard training coverage and out-of-distribution generalization. Use the carousel controls to switch between the two figures.

BibTeX

@article{li2025omnifake,
  title={Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection},
  author={Li, Tianxiao and Huang, Zhenglin and Wen, Haiquan and He, Yiwei and Li, Xinze and Zhu, Bingyu and Duan, Wuhui and Chen, Congang and Fu, Zeyu and Dong, Yi and Wu, Baoyuan and Cheng, Guangliang},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}