A Multi-Objective Ensemble Method for Class Imbalance Learning: Application in Prediction of Life Expectancy Post Thoracic Surgery

A Multi-Objective Ensemble Method for Class Imbalance Learning: Application in Prediction of Life Expectancy Post Thoracic Surgery

Sajad Emamipour (Sharif University of Technology, Tehran, Iran), Rasoul Sali (Sharif University of Technology, Tehran, Iran) and Zahra Yousefi (Sharif University of Technology, Tehran, Iran)
Copyright: © 2017 |Pages: 19
DOI: 10.4018/IJBDAH.2017010102
OnDemand PDF Download:
List Price: $37.50


This article describes how class imbalance learning has attracted great attention in recent years as many real world domain applications suffer from this problem. Imbalanced class distribution occurs when the number of training examples for one class far surpasses the training examples of the other class often the one that is of more interest. This problem may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. Toward this end, the authors developed a hybrid model to address the class imbalance learning with focus on binary class problems. This model combines benefits of the ensemble classifiers with a multi objective feature selection technique to achieve higher classification performance. The authors' model also proposes non-dominated sets of features. Then they evaluate the performance of the proposed model by comparing its results with notable algorithms for solving imbalanced data problem. Finally, the authors utilize the proposed model in medical domain of predicting life expectancy in post-operative of thoracic surgery patients.
Article Preview

1. Introduction

The most widespread type of cancer is lung cancer in the U.S. (Rikova et al., 2016). According to WHO report, in 2012 about 8.2 million people died because of cancer and lung cancer with about 20 percent (1.59 million deaths) of global cancer deaths was the leading cause of death worldwide. It should be noted that although lung cancer is the most common site of diagnosed cancer, it ranks the third in women. About 70 percent of global deaths come from the risk of Tobacco use (Stewart & Wild, 2016).

Lung cancer is one of the main reasons for thoracic surgery. To make the decision of surgery, one of the most important matters that should be considered is the prognosis of mortality which is known as a surgery risk (Esteva, Núñez, & Rodríguez, 2007). The potential risks and benefits should be assessed carefully. There are two terms in surgery risk: 30-day post-operation mortality rate which is called short term and 1-year or 5-year mortality rates which are known as long term. The problem is to choose the right patient for surgery in order to increase the chance of survival after surgery (Zięba, Tomczak, Lubicz, & Świątek, 2014). Knowing the risk of surgery helps both patient and physician to make the most appropriate decision. Another benefit is that decision makers prepare for outcomes and make postoperative plan for advance care management (Falcoz et al., 2007). To predict the risk of surgery, some of the demographic and clinical features such as “pain before surgery”, “smoking status”, “having asthma”, “age at surgery”, and etc. are gathered. Machine learning techniques are among the strongest methods to perform such prediction. In thoracic surgery we face the problem of classification. Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. Our classification problem is binary, it means that each patient classified into survival or death category after the surgery. Since most of the patients survived after the surgery, the thoracic surgery data contains a lot of survived persons with a few deaths. So this kind of data is imbalanced.

Imbalanced data problem arises when samples of the majority class greatly outnumber samples of the other class. Under such circumstances the traditional classification models usually tend to be overwhelmed by the majority class and ignore the minority class examples. Although the imbalanced data problem exists in many disciplines such as credit assessment (Y.-M. Huang, Hung, & Jiau, 2006), computer network security (Cieslak, Chawla, & Striegel, 2006), air pollution (Lu & Wang, 2008) and mine classification (Williams, Myers, & Silvious, 2009), it is more pronounced in medical diagnosis literature and has been explored in a wide range of academic papers ((Mazurowski et al., 2008); (Peng & King, 2008); (Zięba et al., 2014)). Impact of this issue is particularly tremendous in medical data analysis because the cost of misclassifying a minority sample as a majority sample is highly expensive and sometimes unaffordable. The approaches that address the problem are categorized in three groups: data level, algorithm level and hybrid approaches ((Zięba et al., 2014); (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2012)).

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 2: 2 Issues (2017): 1 Released, 1 Forthcoming
Volume 1: 1 Issue (2016)
View Complete Journal Contents Listing