An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

David M. Lewis (Carnegie Mellon University, USA) and Vandana P. Janeja (University of Maryland, Baltimore County, USA)
Copyright: © 2013 |Pages: 23
DOI: 10.4018/978-1-4666-2148-0.ch006
OnDemand PDF Download:
List Price: $37.50


In this paper, the authors present an empirical evaluation of similarity coefficients for binary valued data. Similarity coefficients provide a means to measure the similarity or distance between two binary valued objects in a dataset such that the attributes qualifying each object have a 0-1 value. This is useful in several domains, such as similarity of feature vectors in sensor networks, document search, router network mining, and web mining. The authors survey 35 similarity coefficients used in various domains and present conclusions about the efficacy of the similarity computed in (1) labeled data to quantify the accuracy of the similarity coefficients, (2) varying density of the data to evaluate the effect of sparsity of the values, and (3) varying number of attributes to see the effect of high dimensionality in the data on the similarity computed.
Chapter Preview

Motivating Examples

Spatial Neighborhood Discovery

Let us consider an example in the domain of water monitoring using sensors placed in a river stream (Adam et al., 2004; Janeja et al., 2010). The sensor network comprises of sensors placed in the various parts of the stream, with the goal of detecting anomalous levels of toxicity in a water body. In order to find outliers in the form of anomalous readings in sensors or sensors that may be malfunctioning, it is first required to discover a spatial neighborhood comprising of the relevant sensors with a similar behavior. Each sensor is characterized by a set of attributes or features in proximity such as a factory, bridge, railroad, stream, certain type of vegetation, etc. Such information can be accumulated with the help of domain experts. Indeed, in many cases such studies precede sensor placement.

In addition to spatial proximity, this feature information is used to identify relationships between the sensors to place them in similarly behaving spatial neighborhoods. The study (Adam et al., 2004; Janeja et al., 2010) measures similarities, across the sensors, between feature vectors using the Jaccard coefficient. This facilitates the quantification of the heterogeneity in the neighborhood resulting from the impact of the various features. The study shows the impact of refining the neighborhood, using such similarity coefficients, on the outliers discovered.

Complete Chapter List

Search this Book: