Article Preview
TopIntroduction
The global spread of fake news has evolved into a critical societal challenge, distorting public discourse and threatening political stability through intentionally fabricated information designed to manipulate opinions (Aïmeur et al., 2023; Shu et al., 2017). Social media platforms exacerbate this issue by enabling the rapid dissemination of unverified content, particularly through multi-modal formats combining text and visuals that enhance perceived credibility (Li et al., 2024; Tanwar & Sharma, 2020). This complexity intensifies when misleading images accompany false claims, creating semantic gaps between textual and visual elements that complicate detection efforts (Arowolo et al., 2023; Wang et al., 2024).
The Semantic Web offers technical solutions through structured knowledge representations that enable cross-modal verification. By linking unstructured content with background knowledge via ontologies and linked data, it facilitates semantic correlation analysis between textual claims and visual evidence (Wei et al., 2022). However, effective multi-modal detection faces dual challenges: technical difficulties in fusing heterogeneous text-image features, and the need for robust consistency verification across modalities (Zeng et al., 2023).
Fake news impacts multiple domains, from influencing electoral processes to distorting health-related decisions. For example, vaccine misinformation during the COVID-19 pandemic significantly hampered global responses (Kubin & Von Sikorski, 2021). Semantic Web technologies could strengthen credibility assessment through knowledge graph-based reasoning and enhanced fact-checking mechanisms for multi-modal content (Kim & Kim, 2020; Naeem et al., 2021; Rocha et al., 2021). However, their implantation requires overcoming current technical limitations in multi-modal fusion and semantic alignment.
As the issue of false news escalates, researchers and engineers are increasingly focused on developing efficient methods to identify and mitigate its spread. Intelligent fake news detection systems usually consist of several key modules: data collection, feature extraction, cross-modal fusion, and classification decision. These systems collect textual, visual, and user interaction data from social media platforms and use advanced deep learning models to perform multi-modal feature extraction. For example, Transformer-based architectures like bi-directional encoder representations from transformers (BERT) and contrastive language-image pretraining (CLIP) are able to extract deep semantic information from text and images, while graph neural networks are used to analyze user propagation behaviors and capture propagation patterns in social networks.
To improve detection accuracy, some intelligent fake news detection systems also incorporate cross-modal consistency checking, ensuring that text and images match at the semantic level. For example, L. Wang et al. (2023) suggested a framework based on cross-modal comparative learning that embeds image and text features into a shared space using a dual-encoder model. In particular, the approach introduces a cross-modal consistency task to bring semantically similar images and texts are closer together in the shared space and uses an attention mechanism to improve the effectiveness of multi-modal fusion, thus performing well on fake news detection tasks. However, the approach focuses on cross-modal feature alignment and fusion, falling short on deeper interactions between modalities or user behavioral features.
Yang et al. (2023) suggested a Transformer-based model using modality-specific encoders to derive features for text and images, fusing features from a cross-modal attention mechanism to improve integration. The key innovation of the model is the introduction of a multi-modal attention module that captures cross-modal correlations, making the model more accurate in detecting fake news and inconsistencies between text and images. While adopting the Transformer structure, the methods may not necessarily capture all the contextual dependencies within the text or the high-level feature information in the image. This can lead to insufficient uni-modal feature extraction.