Article Preview
Top1. Introduction
Data warehousing systems and OLAP technologies are effective to analyze data and support the decision-making, especially when data are numerical. However, most of enterprise data are presented in textual forms, such as: Reports, e-mails, etc. Unfortunately, standard tools are inadequate to analyze these textual data, in particular to extract their semantic contents. On the other hand, the aggregation of numerical data is performed by using standard aggregation functions such as sum, average, etc. However, these functions are not suitable to analyze textual data due to their unstructured nature. Thus, it is important to propose new technologies to deal with textual data and aggregate them in an OLAP cube in order to perform effective analyses in the decision support process.
OLAP systems allow navigating through multidimensional cubes from one view to another in an interactive way. OLAP analysis allows expressing complex queries and viewing aggregated results relevant to the decision-making. In order to deal with textual data, we often use IR techniques which generally evaluate the data relevance to a query composed of simple keywords expressing the user needs. Most often, this relevance is based on term frequency in the document. The results generated by IR systems are therefore limited to extracting information from documents. Since in a Text OLAP system, we are interested in navigational analysis based on aggregation operators. For instance, if we consider a corpus of curricula vitae (CVs) for the selection of candidates during recruitment, the answer to a job offer is a large volume of documents that are hard to manage by the decision-maker. Applying OLAP analysis to these data allows decision-makers to navigate through OLAP cubes and to observe the data along several dimensions organized following different hierarchical levels. For instance, the decision-maker can observe the competencies in computer science for the year 2012 in France and then, by applying drill down operations, he can observe those for the year 2012 in Paris, Lyon, etc.
We propose to combine IR techniques with OLAP to better analyze textual data. This combination enables online analysis of large corpuses of documents and supports the navigation in text cubes for the decision-making. Therefore, IR techniques permit to extract relevant information from documents to construct textual analysis measures and to define aggregation operators for textual data. In fact, context, including user profile, task, location, time, etc., affects the information type relevant to a decision-maker and thus contextual information during the exploitation of data warehouses must be taken into account. Considering contextual factors in Text OLAP provides results that directly apply to the analysis context and generates a data model relevant to OLAP analysis. For instance, considering a contextual factor representing location allows us to express the fact that a city belongs to a country. While there have been a variety of context-aware applications, little work has been done on integrating context into data warehousing systems. Here, we consider contextual factors in Text OLAP.
In this paper, we propose a contextual text cube model called CXT-Cube, associated with contextual dimensions. The fact is observed through a new measure for textual data analysis, based on an adapted vector space model. In order to calculate the weights of the document concepts, we propose a relevance propagation technique through a concept hierarchy. Also, we provide a new aggregation operator, denoted ORank, to aggregate the documents during the analysis process. We propose a query expansion method based on the exploitation of the decision-maker context. Both the proposed textual analysis measure and the aggregation operator allow considering the contextual factors defined in our CXT-Cube model. We have evaluated the results of the proposed ORank operator by comparing between the results of three cases: Text aggregation using standard IR system; ORank without considering the user context; ORank considering the two defined contextual factors: Document context and user context. That allows quantifying the improvement provided by our system compared to the classical systems of text warehousing which do not consider the contextual factors and the relevance propagation technique.