A Relative Performance of Dissimilarity Measures for Matching Relational Web Access Patterns Between User Sessions

A Relative Performance of Dissimilarity Measures for Matching Relational Web Access Patterns Between User Sessions

Dilip Singh Sisodia (National Institute of Technology Raipur, India)
DOI: 10.4018/978-1-5225-3870-7.ch010

Abstract

Customized web services are offered to users by grouping them according to their access patterns. Clustering techniques are very useful in grouping users and analyzing web access patterns. Clustering can be an object clustering performed on feature vectors or relational clustering performed on relational data. The relational clustering is preferred over object clustering for web users' sessions because of high dimensionality and sparsity of web users' data. However, relational clustering of web users depends on underlying dissimilarity measures used. Therefore, correct dissimilarity measure for matching relational web access patterns between user sessions is very important. In this chapter, the various dissimilarity measures used in relational clustering of web users' data are discussed. The concept of an augmented user session is also discussed to derive different augmented session dissimilarity measures. The discussed session dissimilarity measures are used with relational fuzzy clustering algorithms. The comparative performance binary session similarity and augmented session similarity measures are evaluated using intra-cluster and inter-cluster distance-based cluster quality ratio. The results suggested the augmented session dissimilarity measures in general, and intuitive augmented session (dis)similarity measure, in particular, performed better than the other measures.
Chapter Preview
Top

Introduction

The access data genrated by Web applications are growing with phenomenal rate, and this web usage data is used to gain extra knowledge about the users’ navigation patterns. The knowledge extrated from the accesses patterns of users may be utilized to serve the needs of the user in a better way. The access patterns of Web users are stored in the server logs and logs are partitioned into user sessions for analysis.

The user sessions are represented as a feature vector of accesed page URLs. The number of accessed page URLs represented the dimension and assigned binary values such as zero and one. Where one and zero denotes the accessing and not accessing respectively of web page (URL) in an individual session (Nasraoui, Hichem, Krishnapuram, & Joshi, 2000).

Let’s consider number of web pages in any website then Eq. (1) may be used to represent the user session as - dimensional binary vector space of web pages.

(1)

The similarity measures are defined between any two user sessions using binary feature vector representation of Eq. (1) and incorporating accessed URLs along with their For better understanding and clear distinction conventional web user session similarity measures are renamed as binary session similarity (BSS), binary URL syntactic similarity (BUSS) and combined binary session similarity (CBSS)(Sisodia, Verma, & Vyas, 2016d). A simple binary session similarity measure between web user sessions and is given by Eq. (2). In this similarity measure individual accessing web page URLs in any user session are completely ignore the syntactic structure of URL and assumed totally independent.

(2)

The main limitations of BSS measure are that it completely ignores the syntactic structure of accessing URLs. In literature (Nasraoui et al., 2000) an alternative URL based syntactic similarity measure is also reported to overcome the limitations of BSS which is renamed as BUSS. The BUSS measure employed the syntactic similarity between any pair of URLs given by Eq. (3).

(3) where is length of URL (or number of edges) of path traversed from root node to respective node of in user session. The Eq. (3) used to incorporate the syntactic similarities of URLs between two binary sessions and computed by Eq. (4).

(4)

The combined similarity measure between two sessions that takes advantage of similarity measures as defined in Eq. (2) and (4) and given in Eq. (5).

(5)

Key Terms in this Chapter

Binary Sessions: The accessed page URLs considered as either accessed or not accessed and represented binary values such as zero and one.

Frequency of Page: The number of times any web page is visited in the session.

Duration of Page: Time spent on a page by web user, and it is the difference between the exact time of the request for previous page and the time of the request for the next webpage in the session from the access log file.

Augmented Sessions Similarity: Similarity between two augmented sessions using the notion of augmented sessions.

Augmented Sessions: Considered web user’s habits, interest, and expectations for accessed page URLs by measuring the relevance of pages in every session because all of the URLs visited in a session are not equally important to the user.

Binary Sessions Similarity: Session similarity between two user sessions considering binary sessions.

Sessions: Session is as a set of web resources requested in a particular time during a website visit.

Complete Chapter List

Search this Book:
Reset