Integrating Unsupervised and Supervised ML Models for Analysis of Synthetic Data From VAE, GAN, and Clustering of Variables

Clustering of variables is a specialized approach for dimensionality reduction. This strategy is evaluated for data reduction with a Kaggle diabetes dataset. Since the original dataset is small, Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE) are used to generate 100,000 records and tested for resemblance to the real data using standard statistical methods. VAE-data is more representative of the real data than GAN-data when analyzed using machine learning (ML) models. Applying Clustering of Variables on VAE-data yields new synthetic variables (SV). SV-data is then augmented with target variable data. Random Forest model is used on VAE and SV data. SV-data results matched VAE-data, proving the new data’s quality. SV-data also provides insights into correlations and data dispersion patterns. This analysis implements a combination of Unsupervised learning (clustering of variables) and Supervised learning (classification) which is reflected in the results.


INTRoDUCTIoN
Machine learning (ML) algorithms can be broadly classified as supervised and unsupervised learning types.Supervised ML is ideal when the target variable data are available along with the feature variables data.They are used for classification and regression problems in general.When the target variable data are not available and the objective is to classify the data into natural groups, unsupervised ML models such as cluster analysis are used.
Clustering is an unsupervised learning model typically used to group items or entities with similar attributes together.Clustering algorithms have been used in multiple domains, such as forecasting customer demand based on recency, frequency, and monetary characteristics (Seyedan et al., 2022), clustering of vascular risk factors (Holthuis et al., 2021), improving predictions of stock market by "using information of similar stocks, determined via clustering, compared to a prediction model that does not take into account such cluster-derived data," (Javier, 2023), and accurately predicting spatiotemporal patterns in travel time by a joint iterative clustering and predicting algorithm (Shaji et al., 2022).
However, as the applications of ML algorithms expand in scope, and with the advent of wearable devices and other technological advances (Tufail et al., 2023), researchers confront two main challenges: (1) obtain a sufficiently large data set necessary for constructing meaningful machine learning models (L'Heureux et al., 2017) and ( 2) high-dimensional data sets are becoming more prevalent across multiple disciplines (Yuan, 2023) including genetics (Chi et al., 2016), organizational psychology, and neuroscience (Waldman et al., 2019).The effectiveness of machine learning models is inherently tied to the quality and quantity of data used for training.However, acquiring such data, which are both abundant and of high quality, is often scarce.
In this context, synthetic data emerge as a pivotal solution.Synthetic data offer a means to overcome the limitations of data scarcity by providing an avenue to generate data sets that possess both the required quantity and quality.By leveraging synthetic data, machine learning practitioners can enhance the robustness and reliability of their models, ensuring they are equipped to make accurate predictions in various domains and applications.
A second component required for the analysis of high-dimensional data is dimensionality reduction.Dimensionality reduction becomes necessary for meaningful data analysis.Dimensionality reduction also provides more insightful visualizations (Xia et al., 2023).It is also the case that available real data in many domains are often limited in size to build robust ML models (de Melo et al., 2022).It is in this context that this study uses 100,000 records of synthetic data (variational autoencoder (VAE)-data)) generated based on the real diabetes data set from Kaggle, which has just 768 records.Clustering of variables is performed on the VAE-data to obtain synthetic variables, which are linear combinations of the features in the VAE-data.The synthetic variables are lesser in number than the features in the VAE-data and thus result in dimensionality reduction.This newly generated synthetic variables data (SV-data) are used to train and test unsupervised clustering and supervised classification models to predict the outcome value of the diabetes condition.The quality of the new synthetic variables in terms of capturing the inherent patterns of the real data is assessed by applying ML methods to both VAE-data and SV-data.The resulting accuracies of each model as applied to the original data are compared.

CoNTRIBUTIoN To ThE LITERATURE
Our contribution to this body of literature encompasses several novel aspects, which are discussed below.First, we introduce a unique integration of unsupervised techniques, such as clustering, with supervised methods, such as classification.This fusion not only represents an innovative approach but also signifies a comprehensive strategy aimed at enhancing classification accuracy by leveraging the strengths of both types of algorithms.
Second, our study demonstrates the quality and reliability of synthetic data generated through this integrated approach.By applying a combination of unsupervised and supervised machine learning models to both VAE and SV, we observed comparable accuracies.Specifically, utilizing a Random Forest classifier on 80% of the VAE-data for training and testing, with the remaining 20% reserved for independent testing, yielded promising results.These findings underscore the potential of synthetic data generators, such as variational autoencoders, in addressing challenges related to limited real data availability and privacy concerns.