In data mining, rule management is getting more and more important. Usually, a large number of rules will be induced from large databases in many fields, especially when they are dense. This, however, directly leads to the gained knowledge hard to be understood and interpreted. To eliminate redundant rules from rule base, many efforts have been made and various efficient and outstanding algorithms have been proposed. However, end-users are often unable to complete a mining task because there are still insignificant rules. Thus, it becomes apparent that an efficient technique is needed to discard useless rules as more as possible, without information lossless. To achieve this goal, in this paper we propose an efficient method to filter superfluous rules from knowledge base in a post-processing manner. The main character of our method lies in that it eliminates redundancy of rules by dependent relation, which can be discovered by closed set mining technique. Their performance evaluations show that the compression degree achieved by our proposed method is better and its efficiency is also higher than those of other techniques.
Knowledge discovery in databases (KDD) refers to the overall process of mapping low-level data in large databases into high-level forms that might be more compact, more abstract, or more useful (Fayyad et al., 1996). KDD can be viewed as a multidisciplinary activity because it exploits several research disciplines of artificial intelligence such as machine learning, pattern recognition, expert systems, knowledge acquisition, as well as mathematical disciplines (Bruha, 2000). Its objective is to extract previously unknown and potentially useful information hidden behind data from usually very large databases. As one of core components of KDD, data mining refers to the process of extracting interesting patterns from data by specific algorithms. Due to intuitional meaning and easy understandability, rule has now become one of major representation forms of extracted knowledge or patterns. Under this context, the result produced by data mining techniques is a set of rules, i.e., rule base.
Currently, the major challenge of data mining is not at its efficiency, but at the interpretability of discovered results. During mining stage, a considerable number of rules may be discovered when the real-world database is large. Particularly, if data is highly correlated, the situation will turn worse and quickly out of control. The huge quantity of rules makes themselves difficult to be explored, thus hampers global analysis of discovered knowledge. Furthermore, monitoring and managing of these rules are turned out to be extremely costly and difficult. The straight misfortune to users is that they can not effectively interpret or understand those overwhelming number of rules. Consequently, users may be buried within the masses of gained knowledge again, and nobody will directly benefit from the results of such data mining techniques (Berzal & Cubero, 2007). Hence, it is an urgent requisite for intelligent techniques to handle useless rules and help users to understand the results from the rapidly growing volumes of digital data.
Post-processing, whose purpose is to enhance the quality of the mined knowledge, plays a vital role in circumventing the aforementioned dilemma. The main advantage of post-processing is that it can effectively assist end-users to understand and interpret the meaning knowledge nuggets (Baesens et al., 2000). The post-processing procedure usually consists of four main steps, i.e., quality processing, summarizing, grouping and visualization. At the core of these routines, rule quality processing (e.g., pruning and filtering) is considered to be the most important one (Bruha, 2000), because this procedure can eliminate lots of noisy, redundant or insignificant rules and provide users with compact and precise knowledge derived from databases by data mining methods. From the view of end-users, a concise or condensed rule base is more preferable, because on the ground of it, decision-makers can make a quick and precise response to unseen data without being distracted by noise information.
In data mining community, many attentions have now been paid on dealing with noise knowledge through measuring similarity or redundancy. For example, distance metrics, e.g., Euclidean distance (Waitman et al., 2006), are often used to measure the similarity between rules, and those rules with high similarity will be discarded. In addition, Chi-square tests (Liu et al., 1999) and entropy (Jaroszewicz and Simovici, 2002) are addressed to analyze the distance between rules in the post-processing phase. Besides, some classifiers explore efficient data structures, such as bitmap technique (Jacquene et al., 2006) and prefix tree (Li et al., 2001), to store and retrieve rules. Moreover, various interestingness measurements, both objective and subjective, are also considered in studying the issue of rules importance (Geng and Hamilton, 2006). As a representative example, Brin et al. (1997) outlined a conviction measurement to express rule interestingness.