Article Preview
TopIntroduction
The explosive growth of information on the Web has resulted in a sharp increase in both the type and quantity of data which has greatly limited the accuracy and efficiency of data mining and knowledge discovery. Clustering is a major approach to address this challenge. The goal of clustering is to divide the acquired data into several classes according to given principles. Data in the same class should have some general features in common regarding the concepts of classifying attributes, to overcome the disadvantages produced by the centralized store of data and to improve the working efficiency of a database. Automatically clustering data into semantic groups promises improved knowledge sharing and inquiry. In this area, it is common for data to be clustered and retrieved by utilizing the information in tags. For traditional clustering methods, only the positive influence of tags is considered (Tian, He, Wang, Sun, & Xu, 2015; Chae, Park, Park, Yeo, & Shi, 2016). However, when a document has many tags, the document may belong to several different topics. If all such tags were used for clustering, the performance will be poor.
Therefore, despite their effectiveness, we argue that existing approaches for clustering methods using a tagging augmented model suffer from some limitations:
- •
Lack of consideration of noisy tags: Each tag has its own semantics and context, and more importantly, strong relationships exist between tags and knowledge. However, existing models have largely ignored the fact that noisy tags cannot sufficiently represent the knowledge;
- •
Missing relevant tags: Tagging augmented clustering methods usually consider only user-generated tags. Other words extracted from the document may represent the document better instead of these.
To address the above limitations in clustering, we propose a new solution, advanced by LDA, named Highly Relevant Tags-Latent Dirichlet Allocation (HRT-LDA). Specifically, we use the tag filtering and appending strategies provided by Word2vec, TF-IDF, semantic computing and degree of representation (DR) as preprocessing methods. We then design a new tag list for documents to be used in LDA topic training. Moreover, to incorporate knowledge sharing and inquiry, we discuss the application of the HRT-LDA model in knowledge recommendation and other scenarios.
To summarize, the main contributions of this work are:
- •
A novel tagging augmented LDA model is presented by considering user-generated tags from both positive and negative effects on the clustering results;
- •
The appending of tags by transfer learning can capture more important feature tags of documents to improve the performance of clustering and to some extent can alleviate the cold start problem (when there are no, or only a few original tags);
- •
Extensive experiments on real-world datasets that show that our method outperforms several existing tagging augmented methods.
The remainder of this paper is organized as follows: We begin by deliberating the existing works in this area. Then we present the HRT-LDA approach. After that we describe the performance when comparing the HRT-LDA approach with existing work. The conclusions of this study and our future work are summarized at the end of this paper.
TopWith the rapid development of big data and cloud computing, data mining has attracted significant attention recently. Moreover, the publication of a large number of research papers demonstrate that clustering methods (Rego et al., 2013; Xu, Yang, & Ma, 2011; Schenk & Lungu, 2013) are effective approaches for enhancing the performance of knowledge integration. In this article, we focus on clustering methods that use a tagging augmented model.