A Novel Tagging Augmented LDA Model for Clustering

A Novel Tagging Augmented LDA Model for Clustering

Yi Zhao (School of Computer Science, Wuhan University, Wuhan, China), Yu Qiao (School of Computer Science, Wuhan University, Wuhan, China) and Keqing He (School of Computer Science, Wuhan University, Wuhan, China)
Copyright: © 2019 |Pages: 19
DOI: 10.4018/IJWSR.2019070104


Clustering has become an increasingly important task in the analysis of large documents. Clustering aims to organize these documents, and facilitate better search and knowledge extraction. Most existing clustering methods that use user-generated tags only consider their positive influence for improving automatic clustering performance. The authors argue that not all user-generated tags can provide useful information for clustering. In this article, the authors propose a new solution for clustering, named HRT-LDA (High Representation Tags Latent Dirichlet Allocation), which considers the effects of different tags on clustering performance. For this, the authors perform a tag filtering strategy and a tag appending strategy based on transfer learning, Word2vec, TF-IDF and semantic computing. Extensive experiments on real-world datasets demonstrate that HRT-LDA outperforms the state-of-the-art tagging augmented LDA methods for clustering.
Article Preview


The explosive growth of information on the Web has resulted in a sharp increase in both the type and quantity of data which has greatly limited the accuracy and efficiency of data mining and knowledge discovery. Clustering is a major approach to address this challenge. The goal of clustering is to divide the acquired data into several classes according to given principles. Data in the same class should have some general features in common regarding the concepts of classifying attributes, to overcome the disadvantages produced by the centralized store of data and to improve the working efficiency of a database. Automatically clustering data into semantic groups promises improved knowledge sharing and inquiry. In this area, it is common for data to be clustered and retrieved by utilizing the information in tags. For traditional clustering methods, only the positive influence of tags is considered (Tian, He, Wang, Sun, & Xu, 2015; Chae, Park, Park, Yeo, & Shi, 2016). However, when a document has many tags, the document may belong to several different topics. If all such tags were used for clustering, the performance will be poor.

Therefore, despite their effectiveness, we argue that existing approaches for clustering methods using a tagging augmented model suffer from some limitations:

  • Lack of consideration of noisy tags: Each tag has its own semantics and context, and more importantly, strong relationships exist between tags and knowledge. However, existing models have largely ignored the fact that noisy tags cannot sufficiently represent the knowledge;

  • Missing relevant tags: Tagging augmented clustering methods usually consider only user-generated tags. Other words extracted from the document may represent the document better instead of these.

To address the above limitations in clustering, we propose a new solution, advanced by LDA, named Highly Relevant Tags-Latent Dirichlet Allocation (HRT-LDA). Specifically, we use the tag filtering and appending strategies provided by Word2vec, TF-IDF, semantic computing and degree of representation (DR) as preprocessing methods. We then design a new tag list for documents to be used in LDA topic training. Moreover, to incorporate knowledge sharing and inquiry, we discuss the application of the HRT-LDA model in knowledge recommendation and other scenarios.

To summarize, the main contributions of this work are:

  • A novel tagging augmented LDA model is presented by considering user-generated tags from both positive and negative effects on the clustering results;

  • The appending of tags by transfer learning can capture more important feature tags of documents to improve the performance of clustering and to some extent can alleviate the cold start problem (when there are no, or only a few original tags);

  • Extensive experiments on real-world datasets that show that our method outperforms several existing tagging augmented methods.

The remainder of this paper is organized as follows: We begin by deliberating the existing works in this area. Then we present the HRT-LDA approach. After that we describe the performance when comparing the HRT-LDA approach with existing work. The conclusions of this study and our future work are summarized at the end of this paper.


With the rapid development of big data and cloud computing, data mining has attracted significant attention recently. Moreover, the publication of a large number of research papers demonstrate that clustering methods (Rego et al., 2013; Xu, Yang, & Ma, 2011; Schenk & Lungu, 2013) are effective approaches for enhancing the performance of knowledge integration. In this article, we focus on clustering methods that use a tagging augmented model.

  • 1.

    Tagging augmented clustering methods which do not consider the negative effect of noisy tags.

Complete Article List

Search this Journal:
Open Access Articles
Volume 17: 4 Issues (2020): 2 Released, 2 Forthcoming
Volume 16: 4 Issues (2019)
Volume 15: 4 Issues (2018)
Volume 14: 4 Issues (2017)
Volume 13: 4 Issues (2016)
Volume 12: 4 Issues (2015)
Volume 11: 4 Issues (2014)
Volume 10: 4 Issues (2013)
Volume 9: 4 Issues (2012)
Volume 8: 4 Issues (2011)
Volume 7: 4 Issues (2010)
Volume 6: 4 Issues (2009)
Volume 5: 4 Issues (2008)
Volume 4: 4 Issues (2007)
Volume 3: 4 Issues (2006)
Volume 2: 4 Issues (2005)
Volume 1: 4 Issues (2004)
View Complete Journal Contents Listing