Machine Learning in Python: Diabetes Prediction Using Machine Learning

Machine Learning in Python: Diabetes Prediction Using Machine Learning

Astha Baranwal (VIT University, India), Bhagyashree R. Bagwe (VIT University, India) and Vanitha M (VIT University, India)
DOI: 10.4018/978-1-5225-9902-9.ch008

Abstract

Diabetes is a disease of the modern world. The modern lifestyle has led to unhealthy eating habits causing type 2 diabetes. Machine learning has gained a lot of popularity in the recent days. It has applications in various fields and has proven to be increasingly effective in the medical field. The purpose of this chapter is to predict the diabetes outcome of a person based on other factors or attributes. Various machine learning algorithms like logistic regression (LR), tuned and not tuned random forest (RF), and multilayer perceptron (MLP) have been used as classifiers for diabetes prediction. This chapter also presents a comparative study of these algorithms based on various performance metrics like accuracy, sensitivity, specificity, and F1 score.
Chapter Preview
Top

Introduction

Diabetes is a disease which happens when the glucose level of the blood becomes high, which eventually leads to other health problems such as heart diseases, kidney disease etc. Several data mining projects have used algorithms to predict diabetes in a patient. Though, in most of these projects, nothing is mentioned about the dangers of diabetes in women post-pregnancies. While data mining has been successfully applied to various fields in human society, such as weather prognosis, market analysis, engineering diagnosis, and customer relationship management, the application in disease prediction and medical data analysis still has room for improvement in accuracy.

Machine learning relates closely to Artificial Intelligence (AI) and makes software applications predict outcomes through statistical analysis. The algorithms used allow for reaching an optimal accuracy rate in predicting the output from the input data. Machine learning follows similar processes used in data mining and predictive modeling. They recognize patterns through the data entered and then adjust the actions of the program accordingly.

Machine learning algorithms are categorized as supervised learning and unsupervised learning. Supervised learning requires input data and the desired output data to build a training model. The training model is built by a data analyst or a data scientist. A feedback is then furnished concerning the accuracy of the model and other performance metrics during algorithm training. Revising is done as needed. Once the training phase is completed, the model can predict outcomes for new data. Classification is one of the many data mining tasks. Classification comes under supervised learning which implies that the machine learns through examples in Classification. In classification, every instance from the dataset is classified into a target value. Classification can either be binary or multi-label. Sometimes, one particular instance can also have multiple classes known as multi-class classification. Classification algorithms are majorly used for prediction and come under the category of predictive learning.

Unsupervised learning is used to draw inferences from the input data which do not have any labeled responses. This data is not categorized, labeled or classified into classes. Clustering analysis, one of the most common unsupervised learning method, is used to find hidden patterns in data or to form groups based on the input data.

While machine learning models have been around for decades, they have gained a new momentum with the rise of AI. Deep learning models are now used in most of the advanced AI applications. If these models are implemented for medical uses, they could be revolutionary for the society. Diagnosis of diseases like diabetes would be easier than ever. Machine learning in medical diagnosis applications fall under three classes: Pathology, Oncology and Chatbots. Pathology deals with the diagnosis of diseases with the help of machine learning models created with the data of diagnostic measurements of the patients. Oncology uses deep learning models to determine cancerous tissues in patients. Chatbots designed using AI and machine learning techniques can identify patterns in the symptoms of the patients and suggest a potential diagnosis or it can recommend further courses of action. This chapter falls under the pathological uses of machine learning as the model created will give diagnosis of whether a patient is diabetic or not.

Key Terms in this Chapter

F1 Score: F1 score is a combination function of precision and recall. It is used when we need to seek a balance between precision and recall.

Accuracy: It is a metric used to predict the correctness of a machine learning model. The model is trained using the train data and a classifier is built. The test data is used to cross validate the classifier model. The percentage of correctly classified instances is termed as accuracy.

Area Under the Curve (AUC) Score: Area under the curve (AUC) is a binary classification metric. It considers all the possible thresholds. Different threshold values result in distinct true positive/false positive rates. As the threshold is decreased, more true positives (but also more false positives) instances are discovered.

Multilayer Perceptron: Multilayer perceptron falls under artificial neural networks (ANN). It is a feed forward network that consists of a minimum of three layers of nodes- an input layer, one or more hidden layers and an output layer. It uses a supervised learning technique, namely, back propagation for training. Its main advantage is that it has the ability to distinguish data that is not linearly separable.

Hyper-Parameter Tuning: The model architecture is defined by several parameters. These parameters are referred to as hyper parameters. The process of searching for an ideal model architecture for optimal accuracy score is referred to as hyper parameter tuning.

Sensitivity/Precision: It is the ratio of true positives to the sum of the true positive and false negative.

Specificity: It is the ratio of true negatives to the sum of the true negatives and false positives.

Supervised Learning: Machine learning is broadly classified into two: supervised learning and unsupervised learning. In supervised learning, the machine learns from examples. Historical or train data is needed which is given as an input to the machine and a classifier model is formed. A supervised algorithm also needs a target value. On the contrary, unsupervised learning algorithms need neither the train data nor the target value.

Logistic Regression: Logistic regression is a classification algorithm that comes under supervised learning and is used for predictive learning. Logistic regression is used to describe data. It works best for dichotomous (binary) classification.

Recall: Recall is the ratio of true positives to the sum of true positives and false negatives.

Complete Chapter List

Search this Book:
Reset