Discovering Similarity Across Heterogeneous Features: A Case Study of Clinico-Genomic Analysis

Discovering Similarity Across Heterogeneous Features: A Case Study of Clinico-Genomic Analysis

Vandana P. Janeja, Josephine M. Namayanja, Yelena Yesha, Anuja Kench, Vasundhara Misal
Copyright: © 2020 |Pages: 21
DOI: 10.4018/IJDWM.2020100104
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The analysis of both continuous and categorical attributes generating a heterogeneous mix of attributes poses challenges in data clustering. Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets. However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In this paper, the authors propose an approach that utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify similarity between data objects. The findings indicate that the proposed approach handles heterogeneous data better by forming well-separated clusters.
Article Preview
Top

Introduction

Clustering is one of the most popular and effective techniques for pattern analysis where data objects are partitioned into meaningful groups called clusters (Bhatnagar, Kaur & Mignet, 2009; Wan, Gao & Li, 2012). Traditional clustering techniques like k-means clustering work well when applied to small homogeneous datasets (Van Hieu & Meesad, 2015; Sreedhar, Kasiviswanath, & Chenna Reddy, 2017). However, as the data size becomes large, it becomes increasingly difficult to find meaningful and well-formed clusters. In addition, in large real-world datasets the attributes are rarely homogeneous and contain both continuous and categorical attributes (Ji et al., 2019) and applying an exclusive selection of homogeneous attributes like continuous attributes, may minimize the effectiveness of detecting hidden clusters when heterogeneous attributes are utilized (D’Urso & Massari, 2019). In the vast literature of traditional data mining and the domain of big data there is very limited work on mixed attribute clustering (Madhuri et al., 2014). Previous approaches for handling heterogeneous data clustering, attempt to cluster data together by converting categorical attributes into continuous attributes. On the other hand, the similarity measures proposed especially for categorical data may not truly capture the inherent nature of the datasets involved given that different similarity coefficients may lead to different outcomes (Lewis & Janeja, 2011). Our paper addresses this gap in the literature by proposing an approach to combine continuous and categorical features for the purpose of cluster analysis in large datasets (Foss & Markatou, 2018). Specifically, we propose the following; a unified clustering approach that combines features from multiple heterogeneous datasets to detect similarity. This approach utilizes a combined similarity function, which looks at similarity across numeric and categorical features and employs this function in a clustering algorithm to identify patient similarity. However, given that clustering large heterogeneous data may result into malformed clusters, we propose an iterative unified clustering approach, which extends our unified clustering by drilling down into such malformed clusters in order to improve clustering outcomes. Indeed, such attributes are commonly found in many real-world applications generating massive amounts of data. For example, in health care, individuals can have varying degrees of susceptibility to a disease, which poses challenges to developing personalized treatments (U.S Food and Drug Administration). Let us a consider a male patient that is 55 years old that weighs 190lbs with a Body Mass Index of 29.7kg/m2 and a history of hypertension and dyslipidemia. He has a family history of Type 2 Diabetes, Coronary Artery Disease and Renal insufficiency. He presents with some of the common symptoms of Type 2 Diabetes like weight gain and takes medication to reduce his cholesterol levels (Hickner, 2011). According to the following assessment, does this patient meet the criteria for a diagnosis of Type 2 Diabetes? Now, we should also consider his genomic makeup because many cases of Type 2 Diabetes are caused by genetic predispositions. In addition, patients with certain genetic makeup respond differently to different treatment plans (Wu et al., 2014). Diabetes has confounded the research and practitioner world and remains a highly prevalent and well-studied condition (Wu et al., 2014). In consideration of this, it is important to integrate genomic factors with clinical data of diabetes patients to identify a well-designed treatment plan. Specifically, similar past diabetes cases can be retrieved to treat new cases based on similar clinical and genomic factors.

Complete Article List

Search this Journal:
Reset
Volume 20: 1 Issue (2024)
Volume 19: 6 Issues (2023)
Volume 18: 4 Issues (2022): 2 Released, 2 Forthcoming
Volume 17: 4 Issues (2021)
Volume 16: 4 Issues (2020)
Volume 15: 4 Issues (2019)
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing