A Comparison of Open Source Data Mining Tools for Breast Cancer Classification

A Comparison of Open Source Data Mining Tools for Breast Cancer Classification

Ahmed AbdElhafeez Ibrahim (Arab Academy for Science, Technology and Maritime Transport, Egypt), Atallah Ibrahin Hashad (Arab Academy for Science, Technology and Maritime Transport, Egypt) and Negm Eldin Mohamed Shawky (Arab Academy for Science, Technology and Maritime Transport, Egypt)
Copyright: © 2017 |Pages: 16
DOI: 10.4018/978-1-5225-2229-4.ch027
OnDemand PDF Download:
List Price: $37.50
10% Discount:-$3.75


Data Mining is a field that interconnects areas from computer science, trying to discover knowledge from databases in order to simplify the decision making. Classification is a Data Mining chore that learns from a set of instances in order to precisely classify the target class for new instances. Open source Data Mining tools can be used to make classification. This paper compares four tools: KNIME, Orange, Tanagra and Weka. Our goal is to discover the most precise tool and technique for breast cancer classifications. The experimental results show that some tools achieve better results more than others. Also, using fusion classification task verified to be better than the single classification task over the four datasets have been used. Also, we present a comparison between using complete datasets by substituting missing feature values and incomplete ones. The experimental results show that some datasets have better accuracy when using complete datasets.
Chapter Preview

Proposed Methodology

Data Processing

Preprocessing steps are applied to the data before classification:

  • Data Cleaning: There are 16 instances in WBC and 4 instances in WPBC that contain a single missing attribute value, denoted by “?”And there are 9 instances in LBCD that have two missing values which substituted by the median value for that feature built on statistics (M. Shah et.al. 2012).

  • Relevance Analysis: The WBC, WPBC and WDBC have one irrelevant feature (D. Sun et.al., 2010) named ‘Sample code number’ which has no influence in the classification procedure; therefore, the feature is not considered.

  • Data Normalization: The goal of normalization is to convert the feature values to a small-scale range (H. Yin et.al., 2002).

The Proposed Approach

We suggested a method for realizing breast cancer using four different data sets based on data mining as follows:

  • Selection of Data Mining Tools to test.

  • Import the Dataset.

  • Discard the irrelevant features.

  • Replace missing values with the mean value.

  • Normalize each variable of the data set, so that the values range from 0 to 1.

  • Select and parameterize the learning procedure.

  • Perform the learning procedure.

  • Calculate the performance of the model on the test set.

  • Perform the fusion task.

Figure 1.

Proposed breast cancer diagnosis algorithm

Table 1.
Benign and malignant datasets
DatasetInstancesAttributesAttribute TypeBenignMalignantMissing Values
Table 2.
Recurrence and non-recurrence datasets
DatasetInstancesAttributesAttribute TypeNon RecurrenceRecurrenceMissing values

Complete Chapter List

Search this Book: