The body's imbalanced glucose consumption caused type 2 diabetes, which in turn caused problems with the immunological, neurological, and circulatory systems. Numerous studies have been conducted to predict this illness using a variety of clinical and pathological criteria. As technology has advanced, several machine learning approaches have also been used for improved prediction accuracy. This study examines the concept of data preparation and examines how it affects machine learning algorithms. Two datasets were built up for the experiment: LS, a locally developed and verified dataset, and PIMA, a dataset from Kaggle. In all, the research evaluates five machine learning algorithms and eight distinct scaling strategies. It has been noted that the accuracy of the PIMA data set ranges from 46.99 to 69.88% when no pre-processing is used, and it may reach 77.92% when scalers are used. Because the LS data set is tiny and regulated, accuracy for the dataset without scalers may be as low as 78.67%. With two labels, accuracy increases to 100%.
Top1. Introduction
Diabetic mellitus (DM) is one of the most common non-communicable diseases globally. It is observed that, 46% of people with diabetes are not diagnosed at early stage. By the year of 2040, it is expected that the count may rise to 642 million all over the globe (Diabetes.co, n.d.). India contributes about 49% of world’s burden. In Southeast Asia region, out of 88 million people with diabetes, India contributes 77 million people which is expected to increase to 134.2 in 2045 (WHO, n.d.).
The number of people with diabetes in India increased from 26·0 million (95% UI 23·4–28·6) in 1990 to 65·0 million (58·7–71·1) in 2016 (Harris et al., 2017). In Maharashtra, overall reported prevalence of diabetes in urban and rural area is 10.9% and 6.5% respectively (WHO, n.d.).
Enormous data and increased complexities have led to rising interest in the use of machine learning (ML) in healthcare. It develops on existing statistical methods and finds patterns in the data. Ml uses different models for prediction of Type 2 diabetes. The accuracy of these models are of prime importance as analysis is directly impacting patient’s life (Khang, Rath, Anh et al, 2024). The aim of this research is to design a predictive model for estimation of diabetes in healthy people with diverse age groups based on different life style related factors i.e. stress, food habit, smoking, profession and exercise. The impact of data pre-processing with different scalers on ML model performance is studied methodically to improve decision support systems for physician (Khang, 2024a).
Some of the important pre-processing steps include data cleaning, pruning, feature selection, and scaling. Many researchers considered diverse ML algorithms along with feature selection (Kaur & Kumari, 2018; Lai et al., 2019) few considered the effect of the data scaling process on overall model performance (Srinivas et al., 2010). Thus, the primary purpose of this study is to evaluate the effect of different data scaling methods on different ML algorithms and develop a prediction model for healthy patients with early diabetes symptoms.
In the present study, five machine learning algorithms like - Logistic regression, K Neighbours (KNN), Gaussian Naïve Bias (GNB), Decision Tree (DT) and Random Forest (RF) and 7 data scaling methods like MinMaxScaler, Sandard scalar, RobustScaler, QuantileTransformer (QT), PowerTransformer (PT) and Normalizer are used together to find the best match for type 2 diabetes prediction. The effect of different data scaling techniques is observed using the UCI PIMA India dataset (Pima Indians Diabetes dataset, n.d.) and LS data set (Patil & Shah, 2019) where data is collected through survey in Indian environment.