A Hybrid Imputation Method Based on Denoising Restricted Boltzmann Machine

A Hybrid Imputation Method Based on Denoising Restricted Boltzmann Machine

Jiang Xu, Siqian Liu, Zhikui Chen, Yonglin Leng
Copyright: © 2018 |Pages: 13
DOI: 10.4018/IJGHPC.2018040101
(Individual Articles)
No Current Special Offers


Data imputation is an important issue in data processing and analysis which has serious impact on the results of data mining and learning. Most of the existing algorithms are either utilizing whole data sets for imputation or only considering the correlation among records. Aiming at these problems, the article proposes a hybrid method to fill incomplete data. In order to reduce interference and computation, denoising restricted Boltzmann machine model is developed for robust feature extraction from incomplete data and clustering. Then, the article proposes partial-distance and co-occurrence matrix strategies to measure correlation between records and attributes, respectively. Finally, quantifiable correlation is converted to weights for imputation. Compared with different algorithms, the experimental results confirm the effectiveness and efficiency of the proposed method in data imputation.
Article Preview

Deleting, ignoring and filling are the three common strategies for data imputation. With the frequency of using and the degree of researching, imputation method is regarded as a more popular strategy for handling missing values (Rahman and Islam, 2011). And, imputation algorithms can be roughly divided by two types: one is model based strategy, the other is statistic based strategy. The article briefly analyzes existing algorithms.

kNNI (Troyanskaya and Cantor, 2001) is applied to the missing gene imputation using k nearest neighbor records to complete the filling. This method is fast and effective which has been widely applied. And, iteration method EMI (Schneider, 2001) is a parameter estimation method. First, it assumes the distribution of data, and randomly initializes parameters and missing values. Then, through E step and M step iterations, it gets results. E step: according to the probability density function, it determines whether the record belongs to some cluster. M step: through cluster results, it obtains parameters estimation using maximum likelihood. But, initialization has a great influence on the results and convergence speed of the algorithm. And, in most cases, there is no prior knowledge of the data to assume the distribution.

Literature (Wang et al., 2006) proposes an algorithm called SVR method. It utilizes support vector machines which divides data into complete and incomplete parts. SVR builds models for every attribute, and then estimates each missing value. This approach is high complexity and only considers the impact of complete data, which leads information lost.

Complete Article List

Search this Journal:
Volume 15: 2 Issues (2023)
Volume 14: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing