Article Preview
TopIntroduction
With the fast development of the computing power and data science, statistical and machine learning methods are applied to a wide range of fields. Numerous researches have been devoted to build accurate and efficient data models (Martin, Sequera & Huerga, 2017; Chiu, Tsai & Li, 2020; Shao, Zhu, Wang, Liu & Liu, 2020). The process of designing and manufacturing industrial products generates large amounts of data. Among many industrial categories, the automotive industry is a representative industrial field. Furthermore, the automobile industry has relatively complete information systems with large amounts of data accumulation.
The industrial data from automotive researches and developments includes design target data, simulation analysis data, test data, manufacturing data, operation data, etc. (Fang, Sun, Qiu & Kim, 2017; Xianping, 2019). These datasets are diverse and hierarchical. Due to the long development cycle of new products, the time span for collecting a sample is long and the amount of data is limited. Therefore, the data of automobile industry generally presents the characteristics of high dimension and small sample size. In addition, with the development of the automotive manufacturing industry, the range of recorded data is gradually expanded, and new indicators and parameters are constantly introduced. Hence all the newly introduced attributes are absent in the earlier records. Besides, the manual collection of data and the digitization progress can also lead to errors and missing values. As a result, it is inevitable to deal with fragmentary data collected in various stages of the development process.
As a key technology in the field of artificial intelligence, data mining and knowledge discovery technology aims to acquire novel and useful knowledge through data processing. Data mining and knowledge discovery technology consists of data acquisition, data pre-processing, data mining, evaluation, and application (Mariscal, Óscar & Covadonga, 2010). The key is to establish descriptive or predictive models through clustering, association analysis, classification, regression, and other machine learning methods.
Various researchers have applied these methods to engineering (Fotouhi & Montazerigh, 2013; Baraldi, Cannarile, Maio & Zio, 2016; Du, Wang, Yang & Niu, 2019). Specifically, a lot of researches have been done on occupant protection design with the help of the data mining methods (Zhao, Jin, Cao & Wang, 2010; Zhang, Ma, Chen & Zhang, 2013; Nie, Tang, Liu, Chang & Zhang, 2018).
These proposed methods are effective and are mostly based on simulated data without missing values. However, the accumulated test data in the research and development process usually contains numerous missing values, which hinder the process of building accurate models.
In general, the missing data mechanisms are classified into three patterns: (1) Missing completely at random (MCAR), when p does not depend on either the observed data or the missing data. (2) Missing at random (MAR), when p could depend on the observed data, but not on the missing data. (3) Not missing at random (NMAR), p could depend on the value of the attribute. p is the probability of a record having a missing value for an attribute. The approaches developed for handling missing values can be broadly divided into three different types (Liu, Pan, Dezert & Martin, 2016):
- •
The first type, which is simplest yet effective in some situations, is to fill with default values or remove directly the record with missing values.
- •
The second type is to process datasets without filling the missing values, such as the research by Pelckmans, Brabanter, Suykens & Moor (2005).
- •
The third type is to impute the missing values by statistical analysis or machine learning methods. A great number of researches are devoted to solving the problem of missing values in this way.