Scalable Data Analysis Application to Web Usage Data

Scalable Data Analysis Application to Web Usage Data

Hocine Chebi (Faculty of Electrical Engineering, Djillali Liabes University, Sidi Bel Abbes. Algeria)
Copyright: © 2021 | Pages: 14
DOI: 10.4018/978-1-7998-4703-8.ch014

Abstract

The number of hits to web pages continues to grow. The web has become one of the most popular platforms for disseminating and retrieving information. Consequently, many website operators are encouraged to analyze the use of their sites in order to improve their response to the expectations of internet users. However, the way a website is visited can change depending on a variety of factors. Usage models must therefore be continuously updated in order to accurately reflect visitor behavior. This remains difficult when the time dimension is neglected or simply introduced as an additional numeric attribute in the description of the data. Data mining is defined as the application of data analysis and discovery algorithms on large databases with the goal of discovering non-trivial models. Several algorithms have been proposed in order to formalize the new models discovered, to build more efficient models, to process new types of data, and to measure the differences between the data sets. However, the most traditional algorithms of data mining assume that the models are static and do not take into account the possible evolution of these models over time. These considerations have motivated significant efforts in the analysis of temporal data as well as the adaptation of static data mining methods to data that evolves over time. The review of the main aspects of data mining dealt with in this thesis constitutes the body of this chapter, followed by a state of the art of current work in this field as well as a discussion of the major issues that exist there. Interest in temporal databases has increased considerably in recent years, for example in the fields of finance, telecommunications, surveillance, etc. A growing number of prototypes and systems are being implemented to take into account the time dimension of data explicitly, for example to study the variability over time of analysis results. To model an application, it is necessary to choose a common language, precise and known by all members of a team. UML (unified modeling language, in English, or unified modeling language, in French) is an object-oriented modeling language standardized by the OMG. This chapter aims to present the modeling with the diagrams of packages and classes built using UML. This chapter presents the conceptual model of the data, and finally, the authors specify the SQL queries used for the extraction of descriptive statistical variables of the navigations from a warehouse containing the preprocessed usage data.
Chapter Preview
Top

Introduction

Access profiles to a website can be influenced by certain parameters of a temporal nature, such as for example: the time and day of the week, seasonal events, external events in the world (wars, economic crises), etc. In this context, most of the methods devoted to Web Usage Mining (Cooley et al., 1999) take into account in their analysis the entire period that records traces of use: The results obtained are therefore naturally those which predominate over the entire period. Thus, certain types of behavior, which take place during short sub-periods are not taken into account, and therefore remain ignored by conventional methods. It is, however, important to study these behaviors and therefore to carry out an analysis covering significant sub-periods. As the volume of data considered is very high, it is also important to use summaries to represent the profiles considered.

To overcome the problem of acquiring real usage data, we propose a methodology for the automatic generation of artificial data allowing the simulation of changes. Guided by the avenues arising from exploratory analyzes, we propose a new approach based on non-overlapping windows for the detection and monitoring of changes on evolving data. This approach characterizes the type of change undergone by the behavior groups (appearance, disappearance, merger, split) and applies two validation indices based on the extension of the classification to measure the level of changes identified at each time step. Our approach is completely independent of the classification method and can be applied to different types of data other than usage data. Experiments on artificial data as well as on real data from different fields (academic, tourism and marketing) were carried out to assess the effectiveness of the proposed approach.

Relatively recently, usage analysis began to take into account the time dependence of behavior patterns. In (Roddick and Spiliopoulou, 2002), the authors review previous work. They summarize the proposed solutions and the outstanding problems in the exploitation of temporal data, through a discussion on temporal rules and their semantics, but also by the investigation of the convergence between data mining and temporal semantics. . Most recently, in (Laxman and Sastry, 2006) the authors discuss in a few lines methods to discover sequential patterns, frequent patterns and partial periodic patterns in data streams.

When it comes to big and dynamic data sources, the web has become the most relevant example with the colossal increase in the number of documents uploaded and new information added every day. From the perspective of attracting new customers and meeting the expectations of existing customers, a knowledgeable website manager should always keep in mind that offering more information is not always a good solution. In fact, users of a website will appreciate more the way this information is presented within the site. The analysis of usage traces (recorded in log type files by the server that hosts the website) is proving to be an increasingly necessary practice to better understand the practices of Internet users. In this context, the time dimension plays a very important role because the underlying distribution of usage data can change over time. This change can be caused by updating the content and / or structure of the website or by the natural change in interest of users of a website.

The change in individual behavior has also caught the attention of professionals in the humanities. Indeed, we are currently in the decade of behavior (2000-2010) established by the American Psychological Association (APA) and whose goal is to promote meetings for raising awareness on the importance of research in the field of science social and behavioral. Website access patterns are dynamic in nature and can be influenced by certain temporal factors, for example: the time and day of the week a website visit takes place, seasonal events (summer, winter, Christmas holidays), one-off events around the world (economic crises, sports competitions, epidemics, etc.). It is therefore necessary to take the temporal dimension into account for the analysis of this type of data.

Complete Chapter List

Search this Book:
Reset