Web Text Categorization Based on Statistical Merging Algorithm in Big Data Environment

Web Text Categorization Based on Statistical Merging Algorithm in Big Data Environment

Rujuan Wang (College of Humanities & Sciences of Northeast Normal University, Changchun, China) and Gang Wang (Northeast Normal University, Changchun, China)
Copyright: © 2019 |Pages: 16
DOI: 10.4018/IJACI.2019070102
OnDemand PDF Download:
No Current Special Offers


In the field of modern information technology, how to find information quickly, accurately and comprehensively that users really needed has become the focus of research in this field. In this article, a feature selection method based on a complex network is proposed for the structure and content characteristics of large-scale web text information. The preprocessed web text is converted into a complex network. The nodes in the network correspond to the entries in the text. The edges of the network correspond to the links between the entries in the text, and the degree of nodes and the aggregation system are used. Second, the text classification method is studied from the point of view of data sampling, and a text classification method based on density statistics is proposed. This method uses not only the density information of the text feature set in the classification process, but also the use of statistical merging criteria to get the text. The difference information of each feature has a better classification effect for large text collections.
Article Preview

1. Introduction

The uses of Information Technology (IT) has increased day which therefore ended to be everything that we are doing, we can directly go through online on the spot. Information technology is any kinds of software or tools for keeping information, retrieve and sending the information using a certain type of technology such as computer, mobile phones, computer networks and more. With this IT, people are now able to upload, retrieve, store their information and collect information to Big Data. Since Big Data hold massive information with the use of IT such as the internet, students are now able to study online which is called as e-Learning. As the tools provided by Information Technology (IT) have increased continuously, these have affected all aspects of our lives, specifically, in the area of academic. Big Data and e-Learning do bring people or the users specifically, both various benefits and disadvantages because of its multi-function ability. Therefore, it affects our social skills, mental growth, physical and risks of invading our personal information (Internet of Things, n.d.). Web Semantics for Textual and Visual Information Retrieval is a pivotal reference source for the latest academic research on embedding and associating semantics with multimedia information to improve data retrieval techniques (Singh et al., 2017).

Data is the concrete form of information presentation. The main source of knowledge we acquire is text data. Therefore, in order to meet the needs of users for fast and accurate information acquisition, it is necessary to effectively classify and manage massive text data. Traditional text categorization and clustering techniques have many problems in dealing with this information, such as reduced scalability, lack of corpus and inadequate classification accuracy.

In recent years, many text classification methods have been proposed, such as a clustering-based PU active text classification method proposed by Liu Lu et al. (2013), which combines SVM active learning and the improved Rocchio classifier. The method improves the weight evaluation function and improves the accuracy of classification to a certain extent; Xu Li et al. (2012) introduced genetic algorithm into SVM text classifier, which reduced the error text to a certain extent; Dhar and so on proposed categorization of Bangla web text documents based on tf-idf-icf text analysis scheme (Dhar et al., 2018). The paper argues that addition of Inverse Class Frequency (ICF) measure to the Term Frequency (TF) and Inverse Document Frequency (IDF) methods can yield better responses in the act of feature extraction from a language like Bangla. The automatic text classification using BPLion-neural network and semantic word processing proposed by Ranjan (2017). It presents a semantic word processing technique for text categorization that utilizes semantic keywords, instead of using independent features of the keywords in the documents. Zhang Xiaofei et al. (2009) fusion clustering operation based on the KNN text classification method to improve the accuracy of text classification; Improving semi-supervised text classification by using Wikipedia knowledge proposed by Zhang Zhilin (2013). It proposed a new similarity measure based on the semantic relevance between Wikipedia features, and apply this similarity measure to clustering based classification. Zhu Jun et al. (2014) proposed an SVM method-based gene/protein name extraction, the accuracy of classification results reached 71.9. %. This method shows good performance when dealing with long text, but it cannot solve short text classification with sparse feature words and high unevenness of sample. It is obviously unable to meet the needs of data classification in the current network platform. Then there are some clustering algorithms for short text, such as the dynamic combination classification method of short text proposed by Yan Rui (2009). Liu Kang et al. (2014) using deep learning network, the space vector of high-dimensional and sparse short text is changed to a new low-dimensional and essential feature space. The method solves the classification of short text by constructing a tree combination classifier structure.

Complete Article List

Search this Journal:
Open Access Articles
Volume 13: 6 Issues (2022): Forthcoming, Available for Pre-Order
Volume 12: 4 Issues (2021)
Volume 11: 4 Issues (2020)
Volume 10: 4 Issues (2019)
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 2 Issues (2016)
Volume 6: 2 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing