Improving Coronary Artery Disease Prediction: Use of Random Forest, Feature Importance and Case-Based Reasoning

Cardiovascular diseases (CVDs) are the number one cause of death globally. Coronary artery disease (CAD) is the most common form of CVDs. Abundant research works propose decision support systems for CAD early detection. Most of proposed solutions have their origins in the realm of machine learning and datamining. This paper presents two solutions for CAD prediction. The first solution optimizes a random forest model (RFM) through hyperparameters tuning. The second solution uses a case-based reasoning (CBR) methodology. The CBR solution takes advantage of feature importance to improve the execution time of the retrieve step in the CBR cycle. The experimentations show that the RFM outperformed most recent published models for CAD diagnosis. By reducing the number of attributes, the CBR solution improves the execution time and also performs very well in terms of diagnosis accuracy. The performance of the CBR solution is intended to be enhanced because CBR is a learning methodology.


INTRoDUCTIoN
According to the World Health Organization (WHO), an estimated 31% of all deaths worldwide are caused by cardiovascular diseases (CVDs).Also, more than 75% of CVD deaths occur in low-and middle-income countries (WHO, 2022).Coronary Artery Disease (CAD) is the most common form of CVDs.It occurs when one or more of the coronary arteries becomes narrow or blocked.In CAD, major blood vessels that supply blood, oxygen and nutrients to the heart become damaged or diseased (WHO, 2022).
Research works on CAD aim for the early detection of this disease in its preliminary stage because this is the only way to prevent from a severe form.Consequently, there is a need for decision support systems that can predict CAD in a non-invasive manner.Abundant research studies propose prediction models to achieve an accurate diagnosis.The majority of proposed solutions have their origins in the realm of machine learning and datamining.
CAD symptoms may differ from person to person.However, because many people have no symptoms, they do not know they have the CAD until they have chest pain, a heart attack, or sudden cardiac arrest.This led to the construction of heart disease datasets from previous patients' records.Most of CAD datasets are provided by the University of California Irvine (UCI) machine learning repository (Dua & Graff, 2019).CAD prediction models can be trained on available datasets and used to diagnose the presence of this disease for new patients.
This paper proposes two solutions for the early detection of CAD.The first solution is based on an enhanced random forest classifier.The optimization of this model is obtained through a long process of hyperparameters tuning.The second solution takes advantage of the features' importance revealed by the random forest algorithm to improve the retrieve step in the Case-Based Reasoning (CBR) cycle.Both solutions are tested on an experimental dataset.The proposed models represent the core of a decision support system for CAD diagnosis.
The paper is organized as follows.Section 2 walks through the background of this research in terms of methodologies, coronary artery disease, the structure of the dataset and a literature review of related recent works that use datamining techniques or CBR in CAD diagnosis.Section 3 presents the authors' approach to deal with the CAD prediction, summarizes the experimentations realized with the proposed solutions and compares the results with related research works.Section 4 concludes this paper and put forth the achievements realized.

Case-Based Reasoning
Case-Based Reasoning (CBR) is a multidisciplinary subject that focuses on the reuse of experiences.The basic notions of CBR (storage of experiences, similarity calculus, indexing) are familiar to researchers in several disciplines (Aha, 1998).In CBR, the primary knowledge source is not generalized rules but a memory of stored cases recording specific prior episodes (Leake, 1996, p. 1).A short definition of CBR is that it is a methodology for solving problems by utilizing previous experiences.It involves retaining a memory of previous problems and their solutions and, by referencing these, solves new problems (Main et al., 2001).To resolve a new problem, a case-based reasoner searches in its case base in order to find a past case or cases that most closely match this problem.
The CBR problem solving cycle goes through four steps (or four 'RE'): REtrieve the most similar case(s); REuse the case(s) to attempt to solve the problem; REvise the proposed solution if necessary; and REtain the new solution as part of a new case (Watson & Marir, 1994).The last step (REtain) enables a CBR system to learn from new situations.Whenever a CBR system successfully resolves a new problem, its case base can be enriched by a new expertise.These novel experiences empower the case-based reasoner to resolve future problems.This is why CBR is considered as an Artificial Intelligence (AI) methodology with an intrinsic learning ability.
The first step in the CBR cycle is responsible of selecting a case or cases, from the case base, to be the best match.This selection is commonly based on a similarity calculus between a new problem and memorized cases.Most CBR systems represent a case as a 'problem' and a 'solution'.Depending on the application domain, a 'problem' is generally described by a set of attributes or features (C 1 , C 2 , …., C P ) that best define a problem.A metric is defined on every attribute and these metrics are combined to get the similarity between two cases.The metric defined on an attribute is commonly called 'local' similarity, and the combination of local similarities produces the 'global' similarity between two cases (Equation 1).

Sim w Loc
Sim: similarity between two cases; Loci: function that calculates the local similarity for attribute number i; wi: weight of 'Loci'.The local similarity depends on the attribute's type.Some commonly used local similarity functions are: Euclidean, Manhattan, Equal.
In real world problems, a case is often described by numerous attributes.In the retrieve step, a CBR system must compute similarity values between the new problem and all cases in the case base, to be able to select the best match(s).It is readily apparent that the larger the number of attributes the more the similarity calculus is time consuming.What is more, because CBR is a learning methodology, the size of the case base will keep growing up.Hence, the retrieve step will rapidly become bottleneck and the performance of the whole system drops down.This is particularly problematic when using CBR in medical field where the system deals with human health and life.The need to improve the retrieve phase becomes clearly a must for CBR systems.

Improve the Retrieve Step
The study of recent research works that aim to improve the retrieve phase in the CBR cycle reveals the existence of two main directions: a. Reduce the number of attributes considered in the similarity calculus: The idea is to use a method to pick the most pertinent attributes to be used to decide whether two cases are similar.Once these attributes are revealed, they are used in the similarity calculus, instead of using all the attributes.The reduction of the number of attributes considered in the similarity calculus will obviously decrease the necessary time to compute the similarity between two cases, and consequently the execution time of the retrieve step.Saadi et al. (2019) propose the generation of a decision tree from the original case base.After that, the best discriminant attributes are deduced from the resulting decision tree.In this work, the authors experimented their approach on "Immunotherapy for the treatment of wart disease".The results showed that the selected cases in the retrieve phase using only pertinent attributes are very close to those obtained by using all the attributes in the similarity calculus.b.Reorganize the case base: In most CBR applications, the case base is represented as a simple table.The retrieve step must run through all cases for similarity calculus.Another way to represent a case base is to reorganize this latter into classes or clusters.In a clustered case base, the retrieve phase is performed in two steps.First, the system computes the similarity between a new case and the clusters' centers to select the cluster where the search will continue.The second step runs through the selected cluster and calculates the similarity between the new case and all cases in this cluster.The most similar cases are selected for reuse.When the clusters are well-proportioned, the time needed to select similar cases is nearly divided by the number of clusters.The improvement in computation time can be very significant in particular for huge case bases.In (Saadi et al., 2020), the authors present an improvement of the retrieve phase by using fuzzy clustering to reorganize the case base.The solution is applied to study the response to immunotherapy treatment for patients with wart disease.The results are very promising and show an improvement in the response time of the system without loss of diagnosis' accuracy.
The two directions proved an enhancement in the retrieve step computation time, and hence an improvement in the CBR cycle.The first direction (rank features by importance) is an offline method.This means that the features' ranking is done first, and is not integrated into the CBR cycle.Hence, this strategy does not induce any extra computation time in the retrieve step.In (Saadi et al., 2020), however, the fuzzy clustering algorithm is integrated into the retrieve step.Even though the use of clusters can reduce the similarity calculus, the integration of clustering introduces extra computation time.What is more, the management of the case base becomes more complex since the retaining of a new case must pass through the clustering mechanism.For these reasons, the authors chose to follow the first direction and go further.Instead of using a decision tree for features' ranking, this research work uses a random forest algorithm which combines a huge number of different decision trees.

Random Forest Algorithm
Introduced by Brieman (2001), the Random Forest Algorithm (RFA) gained tremendous popularity due to robust performance across a wide range of datasets (Wyner et al., 2017, p.7).The RFA can be considered as a generalization of the concept of decision tree.The underlying idea behind this algorithm is that it builds multiple decision trees and combines them to get a more accurate and stable prediction.In other words, the RFA randomly selects observations and features to build several decision trees and then aggregates the results.
One big advantage of the RFA is that it can be used for both classification and regression problems, which form the majority of current machine learning systems.The algorithm is often capable of achieving best-in-class performance with respect to low generalization error making it the off-theshelf tool of choice for many applications (Wyner et al., 2017, p. 7).What is more, overfitting is a critical problem that can negatively influence the performance of a model on new data, but for the RFA, overfitting can be avoided if there are enough trees in the forest.
Another great quality of Brieman's algorithm is that after generating the random forest, it can measure the relative importance of each feature on the prediction.This algorithm analyzes each attribute and reveals the importance of the attribute in predicting the correct classification of the random forest machine learner (Livingston, 2005).By looking at the feature importance one can decide which features to possibly drop because they do not contribute enough to the prediction process.A general rule in machine learning is that the more features you have the more likely your model will suffer from overfitting.
Features' ranking is particularly interesting in the present study.Not only the application of the RFA on the dataset can produce a robust model for CAD diagnosis but will also reveal the importance of every attribute.This research work takes advantage of features' ranking to consider only the most important attributes in the similarity calculus and consequently reduce the time complexity of the retrieve step in the CBR cycle.
A non-formal expression of the RFA (for both classification and regression) is as follow (P: number of predictors or attributes): Algorithm 1: Basic Random Forest Algorithm (RFA) (Kuhn & Johnson, 2013, p. 200) Select the number of models to build, m For i=1 to m do Generate a bootstrap sample of the original data Train a tree model on this sample for each split do Randomly select k (< P) of the original predictors Select the best predictor among the k predictors and partition the data end Use typical tree model stopping criteria to determine when a tree is complete (but do not prune) end Once the model generated, the prediction for new data (rows) is done by aggregating the predictions of the m trees of the random forest (majority vote for classification, average for regression).

The Statlog Heart Disease Dataset
Most proposed models for CAD diagnosis are evaluated on experimental datasets provided UCI machine learning repository (Dua & Graff, 2019).On top of these datasets, the Cleveland clinic foundation Heart Disease (CHD) dataset is regularly used in the assessment of research studies.The other heart disease datasets available via the UCI repository are issued from: the Hungarian Institute of Cardiology (HIC), the University Hospital of Zurich (UHZ), the University Hospital of Basel (UHB), the Long Beach Medical Center (LBMC) and the "Statlog" dataset from the University of California Irvine.Another experimental dataset used in many research studies is the Z-Alizadeh Sani dataset from the Alizadeh cardiovascular imaging department, Tehran, Iran (Alizadehsani et al., 2013).
In this research work, two CAD datasets are utilized: Statlog (270 records, 150 of them do not present a heart disease) and CHD (297 records, 160 of them do not present a heart disease).The Statlog dataset is used to train and test the random forest classifier, and also to initialize the case base for the CBR solution.The CHD dataset is used as a second test for the random forest solution.The description of these datasets is presented in table (1).

LITeRATURe ReVIew: DATAMINING TeCHNIQUeS FoR CAD PReDICTIoN
In order to prevent from a severe form of CAD, abundant research studies proposed models to achieve an accurate early diagnosis.The majority of these research works are based on the use of machine  2018) present a survey of the main used models and techniques for the prediction of heart disease.This study revealed that the most popular models based on supervised learning algorithms and used for the prediction of heart disease are: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF) and ensemble models.
Table (2) summarizes the main recent researches using different datamining techniques for the prediction of CAD.The column 'FS' concerns the use of feature selection (or importance) in the corresponding study.The CHD dataset initially has five target classes representing increasing levels of the disease.Most research studies using this dataset reduce the target classes to two (absence or presence of CAD).The works (Patel et al., 2016) and (Amen et al., 2020) perform a five stage CAD prediction.

Literature Review: Case-Based Reasoning for CAD Prediction
CBR is a methodology for problem solving based on analogy.A new problem is compared to previous successfully solved problems, and a memorized solution is proposed to solve this new problem, often with some adjustments.This strategy of resolution is regularly used especially in diagnosis problems.In the health domain, when it comes to give an accurate diagnosis to a new patient, a doctor remembers previous diagnosis done for similar patients.This represents a significant help in giving a correct diagnosis, and hence the right treatment.Abundant research works that use CBR have proposed decision support systems for a variety of medicine applications (endocrinology, stress management, breast cancer, nephrology, cardiology, respiratory disorder, oncology, etc.) (Begum et al., 2009).The study of heart disease using CBR combined with other IA techniques is very profuse in recent research studies (Prakash, 2015;Jothikumar et al., 2017;Faizal & Hamdani 2018;Ruhin Kouser et al., 2018).These works focus especially on the diagnosis and prediction.In this research study, the authors put the focus only on recent research works that address CAD prediction.
Table (3) summarizes selected recent research works using CBR, combined with other IA techniques, for the early prediction of CAD.Since this research is focused on the improvement of the retrieve phase, it puts forth the aspects that influence the performance of case retrieval: the organization of the case base and the similarity calculus function.The work (Faizal & Hamdani, 2018) was experimented on a particular heart disease dataset issued from Yogyakarta Public Hospital (YPH), Indonesia.

RANDoM FoReST AND CBR FoR CAD PReDICTIoN
For the early detection of CAD, the authors propose two solutions: an optimized Random Forest Model (FRM) and an improved case-based reasoning system (figure 1).The steps (a) thru (d) explain the strategy followed in the design of the RFM solution:  b) To obtain the best RFM, the hyperparameters are tuned through numerous repetitive test-runs.c) Once the best RFM is obtained, it is saved using a serializing tool.The performance of this solution is evaluated on the Statlog test dataset and also on the whole CHD dataset.d) The features' ranking is derived and the most pertinent attributes <f 1 , f 2 , …., f 8 > are selected.
After that, the CBR solution is constructed as follows: e) The Statlog train dataset is considered as the initial case base.f) Local similarity and global similarity functions are defined.g) The CBR retrieve step is applied on the Statlog test dataset to evaluate the performance of this solution.
The features' ranking obtained in step (d) is taken into account to simplify the similarity calculus.

Performance Measures
For the performance measurement of a two-class prediction model, the literature defines the following outcomes: TP: the model correctly predicts the positive class (presence of the CAD); FP: the model incorrectly predicts the positive class; TN: the model correctly predicts the negative class (absence of the CAD); FN: the model incorrectly predicts the negative class.
These outcomes form the confusion matrix of the classifier.Additional statistics are usually considered for the assessment of a classifier.The sensitivity (also known as recall) of the model is the rate that the event of interest is predicted correctly for all samples having the event.On the other hand, the specificity is defined as the rate that nonevent samples are predicted as nonevents (Kuhn & Johnson, 2013, p. 256). (2) The precision and the accuracy of a model in prediction are defined by:

accuracy TP TN TP FP TN FN
From equations ( 3) and ( 5), the FScore can be calculated.It is a number between 0 and 1 and represents the harmonic mean of precision and sensitivity.

FScore sensitivity precision sensitivity precision
Finally, a ROC (Receiver Operating Characteristic) curve is drawn to assess the tradeoff between sensitivity and specificity, and the AUC (Area Under the roc Curve) is calculated.The AUC is far and wide considered as a good summary of the classifier's performance (Huang & Ling, 2005).

Random Forest Model for CAD Prediction
For the generation of the Random Forest Model (RFM), the Scikit-Learn module (Sklearn, 2022) under Python language is used.The Statlog dataset is split into train and test data (80,20).
The next step consists in tuning the hyperparameters of the random forest classifier.For every hyperparameter' possible value, the experimentation was repeated several times until getting the best performance.The generated models were evaluated through a cross validation (10 folds and 3 repeats) and the performance on the test dataset.The hyperparameters of the best RFM are listed in table (4).
To evaluate the RFM, it is first applied to predict the target class for the Statlog test dataset (54 rows).After that, the RFM is tested on the whole CHD dataset (297 rows) to check its performance on a different dataset.The results of these evaluations are listed in table (5) and the ROC curves (with the AUC) are depicted in figures (2) and (3).

Number of estimators (trees) n_estimators=100
Number of samples to draw from the train dataset for each base estimator max_samples=0.50 (50%)

Number of features to consider when looking for the best split max_features=3
Function to measure the quality of a split criterion= "Entropy" The next stage consists in features' ranking in order to reveal the most pertinent ones.Table ( 6) lists the ranking given by the best RFM.It is worth noting that many interesting random forest models generated during the experimentations ranked the 13 features differently.However, the eight leading features are almost always the same (possibly ranked in different order).
In (Ghosh et al., 2021), the authors used the LASSO algorithm to rank the CAD attributes.The selected features are (C 1 , C 3 , C 4 , C 5 , C 8 , C 10 , C 11 , C 13 ).The only difference between this selection and the eight leading features in table ( 6) is the presence of attribute C 11 (slope) in (Ghosh et al., 2021) instead of the attribute C 12 (Ca) revealed by the RFM.Therefore, the first eight features from table (6) can be considered as the most important in diagnosing the CAD.

Case-Based Reasoning for CAD Prediction
The jCOLIBRI (Recio-Garcia et al., 2007) tool is used to implement the CBR system for CAD prediction.jCOLIBRI is a Java open framework developed by the research group GAIA (Group of Artificial Intelligence Applications) from Madrid Complutense University, Spain.The framework is configured according to the application characteristics.
First, local similarities for the 13 attributes are defined.
'x' and 'y' are possible integer values for the attribute, and 'count' represents the length of an integer interval (number of values).For the attribute C 12 , the interval contains four values.Equation ( 1) is used for global similarity calculation.
As stated before (figure 1), the Statlog train dataset is considered as the initial case base for the CBR system.The records from the Statlog test dataset are introduced as new cases to be resolved by applying the CBR system.This study is only focused on the first phase of the CBR cycle: the retrieve step.The objective is to improve this step by reducing the time needed to select the closest case(s) from the case base to be used to resolve a new case, and also to obtain accurate results.Three experiences were performed with the CBR methodology.In the first experience, the global similarity is calculated on all the attributes of the case (13 attributes) equally weighted.In the second experience, the attributes are weighted according to the features' importance listed in table (6).For the last experience, only the eight leading attributes (table 6) are considered in the global similarity calculus.At the end of the retrieve step, the three best cases are selected and the target class is chosen by applying a majority vote.Table (8) lists the performance measures for the three experiences.The corresponding ROC curves are depicted in figures (4), ( 5) and ( 6).

Discussion
Table (9) compares the two proposed solutions for CAD diagnosis with related recent research studies.In this table, only the best performances of the selected works are listed.The comparison takes into account the accuracy in predicting the CAD by chosen models when all attributes are considered (without feature selection).A first glance to table (9) reveals that the accuracy of most listed approaches is close to or above 90%.
What is more, the proposed RFM solution is among the most accurate in diagnosing the CAD.The only works having a significantly better accuracy are Miao et al. (2016) and Tama et al. (2020).In (Tama et al., 2020), an accuracy=98.13% is achieved by applying a two-tier ensemble method on the Alizadeh Sani dataset which is structurally different from the other datasets (CHD, Statlog, UHZ, UHB and LBMC which have the same row' structure).When the same model is evaluated on the Statlog dataset, the accuracy drops to 93.55%.
In (Miao et al., 2016), the accuracy in CAD prediction is 96.72% by applying an adaptive boosting algorithm on the UHZ dataset.UHZ is a small dataset (only 61 rows) and very imbalanced (only 5 records do not present a CAD).When the same model is evaluated on other datasets (CHD and HIC), the performances are clearly lower.In Naravelli et al. (2022), the solution based on the XGBoost algorithm achieved an accuracy=95.03%,which is slightly better than the performance of the proposed solution with the RFM.
Recall that the primary objective of the CBR solution is to improve the retrieve step by reducing the time needed for similarity calculus between two cases.By taking advantage of the features' ranking, the number of attributes considered in global similarity calculus is reduced to eight (instead of 13).This will evidently diminish the execution time of the retrieve step by a factor of 5/13.What is more, table (8) confirms that when features' importance is considered as attributes' weights (experience 2) and when less important features are dropped (experience 3), the performance is undeniably better.In 'experience 3', an accuracy=90.74%indicates that the proposed CBR solution is among the best  for CAD diagnosis.These results mean that not only the retrieve step execution time is improved, but also the accuracy of the solution is better.What is more, the CBR solution will keep improving because it is a learning methodology.The system will retain interesting new resolved cases that will enhance its problem-solving power.

CoNCLUSIoN
In this research, two solutions are proposed for the early detection of coronary artery disease.The first solution consists of an optimized random forest model while the second solution uses a case-based reasoning approach.The RFM was first tuned through a long process to get the best hyperparameters' configuration.
It is worth noting that the optimization of a random forest algorithm is a struggling task.With the same parameters, the bootstrap mechanism produces models with varying performances.The performance of the proposed RFM positions it among the best published models for the prediction of CAD.The CBR solution is improved by features' importance that results in a substantial reduction of the retrieve step execution time.This is greatly important because the case base tends to grow up due to the retain step.The performance of the CBR solution is very satisfactory when compared with other solutions to the CAD prediction in recent research works, and it is intended to improve through the learning capacity of CBR.
The proposed RFM solution is completed by a friendly lightweight user interface that can run on a desktop or a mobile device.This interface can be used by health professionals and patients and directly communicates with a doctor.Through this new module, the physician can monitor the patient and intervenes whenever an emergency arises.

Figure 1 .
Figure 1.Proposed approach for CAD prediction

Table 5 . Performance measures of the RFM on the test data and the CHD dataset
Table (7) lists the functions used to calculate the similarity on every attribute.The first function (in table 7) is the normalized Euclidean distance.The 'Equal' function returns '0' if the values of two attributes are equal and '1' otherwise.The 'Interval' function is used for the attribute 'C 12 '.Its definition is: