Risk Analysis of Diabetic Patient Using Map-Reduce and Machine Learning Algorithm

Risk Analysis of Diabetic Patient Using Map-Reduce and Machine Learning Algorithm

Nagaraj V. Dharwadkar (Department of Computer Science and Engineering, Rajarambapu Institute of Technology, Sakhrale, India), Shivananda R. Poojara (Department of Computer Science and Engineering, University of Tartu, Estonia) and Anil K. Kannur (Department of Computer Science and Engineering, Rajarambapu Institute of Technology, Sakhrale, India)
DOI: 10.4018/978-1-7998-3053-5.ch014
OnDemand PDF Download:
No Current Special Offers


Diabetes is one of the four non-communicable diseases causing maximum deaths all over the world. The numbers of diabetes patients are increasing day by day. Machine learning techniques can help in early diagnosis of diabetes to overcome the influence of it. In this chapter, the authors proposed the system that imputes missing values present in diabetes dataset and parallel process diabetes data for the pattern discovery using Hadoop-MapReduce-based C4.5 machine learning algorithm. The system uses these patterns to classify the patient into diabetes and non-diabetes class and to predict risk levels associated with the patient. The two datasets, namely Pima Indian Diabetes Dataset (PIDD) and Local Diabetes Dataset (LDD), are used for the experimentation. The experimental results show that C4.5 classifier gives accuracy of 73.91% and 79.33% when applied on (PIDD) (LDD) respectively. The proposed system will provide an effective solution for early diagnosis of diabetes patients and their associated risk level so that the patients can take precaution and treatment at early stages of the disease.
Chapter Preview


In last one decade, the health care systems in all over the world have grown and in the same phase, the large volume of data has proportionally increased in the form of patients’ records. These patients’ records include the x-rays, images, clinical information reports, diagnosis, prescriptions, etc. Nowadays the health care systems are generating and maintaining the data in digital form. This data has particular type such as structured, unstructured, semi-structured or combination of these types. The data generated at health care systems will be helpful not only for the treatment of the patients but also to track what possible may happen to patients’ health in future. Health care systems are concerned with all the diseases and with possible diagnosis and treatment of the patient. The Health care system faces many challenging health hazards in day-today, and one of such challenging health hazards is diabetes. Diabetes is one among the four major non-transmissible diseases that is under attention of healthcare sector. Diabetes is caused when a human body does not generate enough insulin. Diabetes has become the most dangerous health issue in this very world. The numbers of diabetes patients are increasing day by day since past few years. On the basis of WHO (World Health Organization) worldwide survey report on diabetes says that, in the year 1980,108 million people were Diabetic and in 2014, the count of people increased up 422 million suffering from diabetes. From 1980 to 2014, the average percentage in increase of diabetic patients randomly changed and as on today more than 600 million suffering from diabetes. There are three types of diabetes that can be characterized into: type-1, type-2 and gestational diabetes. The type-1 diabetes is an insulin-dependent diabetes that occurs due to the inability of patient’s pancreas to generate insulin. In this case, the patient needs to inject insulin externally for the survival. The type-2 diabetes is non-insulin dependent diabetes, caused due to unsuccessful utilization of insulin produced by human body cells. The gestational diabetes is third type which is occurs during pregnancy, this is a temporary condition from which pregnant woman may suffer. This condition can lead patient towards the development of type-2 diabetes. It may cause complications in pregnancy (World Health Organization, 2016). Diabetes of each category can make complications in the human body. It leaves long term bad impact on human body. Due to diabetes patient may suffer from different diseases such as heart stroke, attack, eye blindness, leg amputations, kidney failure etc. For pregnant women, diabetes may lead to the possibility of fatal death or other complications in pregnancy. As diabetes has long-term complications associated with it, the patient may suffer from economic loss through regular medical checkups cost, prescription cost (World Health Organization, 2016). Hence it is necessary to control this rising serious disease. There is no permanent and complete solution for diabetes as on date. Hence to make an easy living with diabetes better solution is an early diagnosis of diabetes or diseases that can be caused due to diabetes. To reduce the influence of diabetes contribution of everyone is required. The government, health care industries, technology specialist, the scientists must provide solutions to overcome this issue (World Health Organization, 2016). With the help of technologies such as machine learning, data mining it is possible to analyze diabetes data and accordingly provide early diagnosis, precautions and pre-treatment to the patient (Eswari, T, 2015). The digital data generated in health care systems will be helpful in analyzing the risks in patients. Not only the health care systems researching on the diabetics but also the engineering and biotechnology disciplines are working in the same sector of diabetes to provide the solutions in preventing the risks in patients. One of the engineering disciplines is computer science, there are many techniques available for analyzing the digital data of patients and predict what may possibly happen. The problem statement for the work is to analyze the risk of the diabetic patient, and we propose the methodology for the same using Map-reduce and decision tree-based algorithms. In the proposed work, we are focusing on the analysis of diabetes data. The system first imputes the missing values in diabetes dataset and then parallel process the data by using machine learning algorithms integrated with Hadoop Map-Reduce environment to discover patterns from it. The system uses these patterns for the diabetes patients’ diagnosis and prediction of associated risk levels. Many researchers from the different disciplines contributed through their research on the same problem statement. The next section gives the overview of the prior work carried out by other researchers.

Key Terms in this Chapter

Pattern-Matching: It is process of checking the given sequence of strings for the presence of the elements in the pattern.

Plasma-Glucose: It is blood sugar level at particular time period.

Training: It is process of making a machine learning model to learn from the training datasets.

Machine-Learning: It is a system with ability to learn and improve by experience automatically without being programmed explicitly.

Pattern-Discovery: It is process of recognizing the patterns using machine learning algorithms.

Testing: It is process of making a machine learning model to test the unknown sample from testing datasets for the predictions.

Tolerance: The capacity to sustain continued weakness to glucose level conditions without adverse effect.

Map-Reduce: It is framework that allows processing the data with distributed and parallel algorithms.

Hadoop: It is distributed processing framework used for data processing and storing for big-data applications which runs on clustered systems.

Complete Chapter List

Search this Book: