An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

David M. Lewis (Carnegie Mellon University, USA) and Vandana P. Janeja (University of Maryland, Baltimore County, USA)
Copyright: © 2011 |Pages: 23
DOI: 10.4018/jdwm.2011040103


In this paper, the authors present an empirical evaluation of similarity coefficients for binary valued data. Similarity coefficients provide a means to measure the similarity or distance between two binary valued objects in a dataset such that the attributes qualifying each object have a 0-1 value. This is useful in several domains, such as similarity of feature vectors in sensor networks, document search, router network mining, and web mining. The authors survey 35 similarity coefficients used in various domains and present conclusions about the efficacy of the similarity computed in (1) labeled data to quantify the accuracy of the similarity coefficients, (2) varying density of the data to evaluate the effect of sparsity of the values, and (3) varying number of attributes to see the effect of high dimensionality in the data on the similarity computed.
Article Preview

Motivating Examples

Spatial Neighborhood Discovery

Let us consider an example in the domain of water monitoring using sensors placed in a river stream (Adam et al., 2004; Janeja et al., 2010). The sensor network comprises of sensors placed in the various parts of the stream, with the goal of detecting anomalous levels of toxicity in a water body. In order to find outliers in the form of anomalous readings in sensors or sensors that may be malfunctioning, it is first required to discover a spatial neighborhood comprising of the relevant sensors with a similar behavior. Each sensor is characterized by a set of attributes or features in proximity such as a factory, bridge, railroad, stream, certain type of vegetation, etc. Such information can be accumulated with the help of domain experts. Indeed, in many cases such studies precede sensor placement.

In addition to spatial proximity, this feature information is used to identify relationships between the sensors to place them in similarly behaving spatial neighborhoods. The study (Adam et al., 2004; Janeja et al., 2010) measures similarities, across the sensors, between feature vectors using the Jaccard coefficient. This facilitates the quantification of the heterogeneity in the neighborhood resulting from the impact of the various features. The study shows the impact of refining the neighborhood, using such similarity coefficients, on the outliers discovered.

Complete Article List

Search this Journal:
Open Access Articles
Volume 16: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 15: 4 Issues (2019): 3 Released, 1 Forthcoming
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing