Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments

David Taniar (Monash University, Australia) and Laura Irina Rusu (Latrobe University, Australia)
Release Date: December, 2009|Copyright: © 2010 |Pages: 448
ISBN13: 9781605667171|ISBN10: 160566717X|EISBN13: 9781605667188|DOI: 10.4018/978-1-60566-717-1

Description

Organizations rely on data mining and warehousing technologies to store, integrate, query, and analyze essential data.

Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments discusses developments in data mining and warehousing as well as techniques for successful implementation. Contributions investigate theoretical queries along with real-world applications, providing a useful foundation for academicians and practitioners to research new techniques and methodologies.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Cross-selling using data mining techniques
  • Data Warehousing and Data Mining
  • Elasticity of spatial data
  • Gene Expression Data
  • Heterogeneous data warehouse schemas
  • Medical Document Clustering
  • Mining temporal relational patterns
  • OLAP manipulations
  • Ranking systems in relation to data mining
  • Sequential patterns in relation to pattern mining
  • Vertical fragmentation using data-mining techniques

Table of Contents and List of Contributors

Search this Book:
Reset

Preface

1. Introduction

In the context of the dynamic business environment which exists nowadays, it is increasingly critical that organizations use only quality and up to date information, in order to make successful business decisions. Data warehouses are built with the aim of providing integrated and clean chronological data. Additionally, they are accompanied by tools which allow business users to query and analyse the warehoused data. These reporting tools are often seen as the finality of data collected in the warehouse. However, data mining applications do more than using simple or complex statistical functions often used in reporting; they try to discover interesting hidden information from the data collected, by looking at patterns, relationships, clusters of data, outrigger values, etc.

The purpose of this book is to present and disseminate new concepts and developments in the areas of data warehousing and data mining. The focus is on latest research discoveries and proposals in these two areas, in particular on the research trends shaped during the last few years.

Web applications’ usage in particular continues to see a dramatic increase, especially in the areas where customer interaction is sought after by business, but not only. The language of choice for web applications is XML (eXtensible Markup Language). It is then very important that a data warehouse is built with a supporting technology that allows global information interactions and management - this is the reason for the emergence of XML representation for web data warehousing. The hierarchical nature of XML within an XML data warehouse and the fact that it represents a document warehouse rather than a traditional table-based relational warehouse, have raised the need for new methodologies in dealing with warehousing issues. Also, performing data mining by extracting knowledge from web (XML) data has proved to be a challenging task. The dynamicity of the web data, together with the complexities brought in by the XML’s flexible structure, required new techniques to deal with data mining issues.

This chapter is split in two main sections: the first section first gives an overview of the data warehousing and discusses its importance in providing support for the decision making process, followed by a discussion about the state of the art research work in this area; the second section presents the importance of extracting profound knowledge from data by employing data mining tools, followed by an analysis of the current trends and research work in this area. As mentioned, web XML data is a growingly significant presence in various types of applications, therefore an important fraction of this chapter will be dedicated to discussing current research trends in web XML data warehousing, respectively web XML data mining.

2. Current Trends in Data Warehousing

This section first gives an overview of the data warehousing generic concepts and then it looks at the latest trends and advancements of the research work in this domain.

2.1. Data Warehousing – An Overview

The concept of data warehouse was defined by [Inmon, 2005] as a “[…] subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decisions”. Another definition was given by [Marakas, 2003], that a data warehouse is “[…] a copy of transaction data specifically structured for querying, analysis and reporting”.

It has been shown that for relational database systems, the operational and historical data cannot exist within the same structure because neither of them would perform well. Several reasons have been enumerated to support the need for a separate data warehouse to collect data for decisional support [Han & Kamber, 2006]:

  • A data warehouse is subject-oriented, which means that it is built around the major focus of an application: customer, product, sales etc. Conversely, an operational system contains all data produced by daily transaction processing and, depending on the application, might include data not required by the decision process;
  • A data warehouse might contain information integrated from multiple sources (e.g. relational database, other files of different formats, emails etc), whereas operational systems always contain data produced by the application, most usually in the form of data in relational tables;
  • A data warehouse is time variant and reflects the collection of data during a large period of time (e.g. 5-10 years or more), while operational systems do not keep data for such a long period of time;
  • A data warehouse is used for the decision process only, and not for daily transaction processing. This means that a low number of queries are applied on larger volumes of data, compared with operational systems which are queried very often but for low volumes of data (e.g. to respond customer enquires);
  • Finally, a data warehouse is non-volatile, which means that after the warehouse is built, data is uploaded and never deleted; new data arrives periodically and expands the data warehouse; queries are usually applied by the data analysts and therefore a data warehouse does not require concurrent transaction processing features; comparatively, operational systems are updated frequently, should allow multiple users, and hence should allow concurrent transaction processing.

    Generic data warehouse architecture as proposed by [Kimball & Ross, 2002] contains the following four main areas:

  • Operational source systems – these contain the transactional data, used in day to day operations, by a large number of users (possible hundreds or thousands in large applications). The data in these systems needs to be current, guaranteed up to date and the priority is high performance and high availability of the information;
  • Data staging area – this is the area where the ETL (Extract – Transform – Load) process takes place. As the name says, the data is extracted from the operational source systems, then it is cleaned, transformed and integrated, and finally it is sent to be loaded into the data warehouse. Note that data staging area does not provide presentation services to the business users, in other words business users are not the direct consumers of the ETL process’ output;
  • Data presentation area – this is where the data is organised and stored, ready to be accessed by the business users. The storage is modelled as a series of data marts, depending on the specific business requirements, all conforming to the data warehouse bus architecture. For more details on data marts and bus architecture we refer the reader to [Kimball & Ross, 2002], whose authors have actually introduced these concepts in the warehousing literature;
  • Data access tools – this is a collection of ad hoc query tools, report writers, data mining applications etc, all designed to query, report or analyse data stored in the data warehouse.

    The data warehouse architecture described above was proposed for traditional relational databases, where the data is structured and therefore easier to manipulate. In the case of other types of complex data though, this architecture might need some alterations to support the complex scenarios (for example the case of warehousing dynamic and temporal web XML data).

    2.2. Latest trends and advancements in data warehousing

    Nowadays the information is very dynamic and many applications create huge amounts of data everyday. For example, millions of transactions are completed continuously in large online shops (e.g. Amazon, EBay etc), many banking systems, share markets etc. Companies in the entire world are in a tight competition to provide better services and attract more clients. Hence, an easy customer access to web applications and the ability to perform transactions anytime from anywhere has been the fastest growing feature of the business applications in the last few years.

    More, the data itself cannot be labelled as “simple” anymore (that is, numerical or symbolic) but it can now be expressed in different formats (structured, unstructured, images, sounds etc), it can come from different sources, or can be temporal (that is, it would change its structure and/or values in time) [Darmont & Boussaïd, 2006]. Consequently, different types of storage, manipulation and processing are required in order to manage this complex data. New visions on data warehousing and data mining are therefore required.

    During the last few years we could witness a growing amount of research work, determined by the growing size of the data warehouses which need to be build, the more and more heterogeneous data which needs to be integrated and stored, and the complex tools needed to query it. This section discusses therefore some of the trends in the area.

    2.2.1. Spatial Data Warehousing

    This is an area concerned with integration of spatial data with multidimensional data analysis techniques. The term ‘spatial’ is used in a geographical sense, meaning data that includes descriptors about how objects are located on Earth. Mainly, this is done using coordinates.

    There are quite a few types of data which can be considered as spatial, as follows: data obtained via mobile devices (e.g. GPS), geo-sensory data, data about land usage etc. These types of data are collected either by private companies (e.g. mobile data) or by public governmental bodies (e.g. land data). Because this type of information could be used to take security decisions, spatial data warehousing becomes therefore a key techniques in enabling access to data and data analysis for decision making support.

    Spatial data warehousing can be seen as an integration of two main techniques: spatial data handling and multidimensional data analysis [Damiani & Spaccapietra, 2006].

    Spatial data handling can be done using two types of systems:

  • Spatial database management systems – these extend the functionality of regular DBMS by including features for efficient storage, manipulation and querying of spatial data. Two such commercial spatial DBMS are Oracle Spatial and IBM DB2 Spatial Extender [Damiani & Spaccapietra, 2006]. Note that spatial DBMS are not for direct end-user usage, but would be interrogated by database specialists to produce reports, various analyses etc;
  • GIS (Geographical Information System) is an integration of computer programs written to read and represent information from a spatial DBMS, and present it in a nice visual way to the end-user. Note that, in this case the end-user would be the direct consumer of the GIS output, without the need of a database specialist’s help.

    Multidimensional data analysis is a leading technique in many domains, where various quantitative variables are measured against dimensions, such as time, location, product etc. Information is stored in cubes, at the intersection between selected dimensions, and offers the possibility of analysis by drilling-down, rolling-up, slicing, dicing etc.

    By integrating spatial data with multidimensional data analysis technique, spatial information can be studied from different perspectives, including a spatial dimension. This integration is already very powerful in existing business systems, for example in warehousing enterprise information, where localisation of data could be decisive.

    Nevertheless, spatial data warehousing is still a young research area, where multidimensional models are yet to be determined and implemented. The reason why this area is a step behind the business domains is because spatial data is peculiar and complex, and spatial data management technology has only recently reached maturity [Damiani & Spaccapietra, 2006].

    Research work in this area started ten years ago with the introduction of concepts of ‘spatial dimension’ and ‘spatial measure’ by [Han et al, 1998]. Following that, most recent research literature in spatial data warehousing includes: Rivest et al (2001), Fidalgo et al (2004), Scotch & Parmantoa (2005), Zhang & Tsotras (2005) and many others. As at 2006, a comprehensive and formal data model for spatial data warehousing was still a major research issue. The authors of [Damiani & Spaccapietra, 2006] proposed a novel spatial multidimensional data model for spatial objects with geometry, called Multigranular Spatial Data Warehouse (MuSD) where they suggest representing spatial information at multiple levels of granularity.

    Future work and trends identified in spatial data warehousing are related to storage, manipulation and analysis of complex spatial objects. These are objects that cannot be represented using geometries such as lines, polygons etc, but they are spatio-temporal and continuous object: an example is the concept of trajectory of a moving entity. Research is focused nowadays on obtaining summarised data out of database of trajectories, by using the concept of similarity and proposing new methods of measuring this similarity.

    2.2.2. Text Data Warehousing

    Recently, this area has known more and more research interest, because in an enterprise setting the data which needs to be stored in a data warehouse does not usually come only from the operational database systems, but also from a range of other sources (such as email, documents, reports). Generally, it can be said that in an enterprise environment the information lives in structured, unstructured and semi-structured sources. In order to integrate data from structured systems (relational databases) with the structured or unstructured data, current approaches use Information Retrieval techniques. Other approaches propose to use Information Extraction paradigms. We present here some of the research work which use these approaches.

    In order to incorporate text documents into a data warehouse where the data from structured systems is also stored, the same components and steps of the warehousing process, including ETL (Extract-Transform-Load), need to be followed [Kimball & Ross, 1996; Kimball & Ross, 2002]. The source documents need to be identified, then the documents need to undergo some transformations (e.g. striping emails of their header and storing them as separate entities, or striping documents of their format and storing only the text component); eventually, the documents are physically moved into the data warehouse.

    For each document, two components need to be stored in the data warehouse: the document itself and the metadata (this is, information about the document which can be read by the computer, e.g. size, title, subject, version, author etc). The metadata is stored in the data warehouse in a separate section, called metadata repository. Kimball (2002) deems that, for highly complex metadata, even a small star schema can be constructed, where the fact table would store the actual documents, while the different types of metadata would be stored in dimensions.

    To store the document content itself, Information Retrieval approaches treat it as a “bag of words”. Each word from the document is scanned and tokenised, the linking and stop words are removed, and the output of the procedure is a list of words - where each word receives a weight, based on the number of its appearances in the initial text. The output is used in so called “inverted index”, where an index is created by sorting the list of terms and, for each term, a list of documents which contain each term is kept. Other more complex indexes also keep the number of appearances, or the position(s) where the term appears, in order to support proximity queries [Badia, 2006]. In a vector-space approach, queries are represented as vectors of query terms (where the non-content words have been removed) and answering those queries actually means to find the document vectors of terms which are closest to the query vector.

    The IR (Information Retrieval) approaches are criticised because they only utilise the physical terms appearing in the documents, while these terms are “(at best) second-order indicators of the content of a document”. More, the vector-space approach has some issues related to the usage of words as such, especially where synonyms, homonyms, and other relationships can appear (research work has proposed to solve these issues by employing the concept of ‘thesaurus’) [Badia, 2006].

    Information Extraction (IE) is another approach employed to solve text data warehousing problem. In this case, the input collection of documents are analysed using “shallow parsing” to perform entity extraction (to determine which entities are referred to in the text), link extraction (to determine which entities are in any sort of relationships) and event extraction (to determine what events are described by the entities and links discovered). The information extracted would then be integrated into a data warehouse.

    It is possible that IR-oriented techniques would dominate the text warehousing area for a while, because there is still a lot of research work proposing solutions to deal with the identified issues. However, it is prognosed that IE approaches will see a rapid growing in the near future, fuelled by the boost in requests for text mining and consequently for more efficient text warehousing techniques [Badia, 2006].

    2.2.3. Web Data Warehousing

    As mentioned in the Introduction section, web data has been an increasing presence nowadays, because the World Wide Web offers the foundation for many large scale web applications to be developed.

    The language of choice for web applications is XML (eXtensible Markup Language). It is a well- known fact that the industry has lately started to use XML heavily in a wide range of areas such as data exchange, as a vehicle for straight-through-processing (STP) in the online banking system, for data exchange between various layers in a multi-tiered system architecture, for development of web services, standard representation of business requirements for reporting purposes, etc. The domains where XML is used are very diverse, from technology to financial services, medical systems, bioinformatics, etc. Consequently, we have witnessed a rapid increase in the amount of information being represented in XML format, which has triggered the need for researchers to investigate a number of issues associated with XML storage, querying and analysis.

    One area of research is XML warehousing. It has been predicted that soon most of the stored data will be in XML format [Pardede E., 2007]. Other authors also predict that XML will become the ‘lingua franca’ of the web [Mignet et al, 2003]. It is therefore critical that efficient and scalable XML warehousing techniques should exist. At the same time, the great flexibility of the XML format for data representation and the dynamicity of available XML data increase the difficulty which is naturally associated with the task of storing huge amounts of information. In this section we first look at the growing popularity of XML, and then at some issues and research requirements brought in recently by the XML warehousing.

    XML format popularity

    A study dated 2003 concluded that, despite its infancy at that time, XML had already permeated the web [Mignet et al, 2003]. That is, at the time of the study, XML documents could be found in all major internet domains and in all geographic regions of the globe. In 2003, the ‘.com’ and ‘.net’ domains combined contained 53% of the documents and 76% of the volume of the XML content on the web.

    More, only 48% of the documents referenced a Document Type Definition (DTD) [W3C-XML 1.0] and a surprising figure of only 0.09% of the analysed XML documents were referencing an XML schema.

    The same study showed that generally the XML documents were small (with an average size at that date of 4kB), but they were shallow (with 99% of the XML documents having fewer than 8 levels of elements nesting). Moreover, the volume of markup tags was very large compared with the actual content of the documents.

    Naturally, things have evolved since 2003, and nowadays the volume of data formatted using XML is even higher. While there are no certain figures to tell the XML data size and usage distribution as at 2009 (the current year), the extent to which XML is used as a standard format for data exchange and representation in many domains indicate its wide usage nowadays. A few representative examples are as follows:

  • GPX (the GPS Exchange Format) – this is a light-weight XML data format for the interchange of GPS data between application and Web services on the Internet [GPX, 2008];
  • HL7 (Health Level Seven) – this is an ANSI-accredited organisation in the domain of clinical and administrative data, which uses XML since 1996. A special ‘XML interest group’ exists within HL7, with the declared objective to employ XML to create standardised templates of healthcare documentation [HL7, 2008];
  • FIX (Financial Information eXchange) – this is a standard communication protocol for equity trading data. The FIXML (FIX Markup Language) creates business messages for the FIX protocol using XML format [Ogbuji, 2004];
  • XBLR (eXtensible Business Reporting Language) – this is an XML-based specification for preparation and exchange of financial reports and data [Ogbuji, 2004];
  • XPDL (XML Process Definition Language) – this uses XML to represent the BPM (Business Process Management) processes. It is considered the most widely deployed process definition language, and is used in many applications such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), BI (Business Intelligence), workflow, etc [WfMC, 2008];
  • BPEL (Business Process Execution Language) – this is an entirely different standard but complementary to XPLD. BPEL is an ‘execution language’ which uses XML to provide a definition of web services orchestration (the executable aspects of a business process, when the process is dealing with web services and XML data) [WfMC, 2008].

    The above list gives just a small number of examples indicating how widely used XML is nowadays.

    XML data warehousing

    In line with the increasing use of XML documents for the exchange of information between different legacy systems and for representing semi-structured data in web applications, XML data warehousing has started to pose diverse issues to the researchers and practitioners [Buchner et al., 2000; Marian et al., 2001; Xyleme, L., 2001]. As in traditional data warehousing, the design of the most suitable and efficient structural storage systems has been one of the main focuses of XML data warehousing.

    In a number of previous research works we proposed to distinguish between static and dynamic XML documents [Rusu et al, 2005; Rusu et al, 2006; Rusu et al, 2008], based on the representation of temporal coverage and persistency of the rendered information. In other words, a static XML document does not contain an element which indicates how long the document is valid. Conversely, a dynamic XML document inherently contains at least one element which indicates the temporal coverage of the specific version of the document. Hence, the analysis of the state of the art work in the area of XML data warehousing can be also split in two: first we will discuss the problem of warehousing data from static XML documents, and then we will examine existing work and issues identified in warehousing data from dynamic (multi-versioned) XML documents.

    Warehousing static XML documents first became a research focus as early as 2001-2002. Few major directions have started to emerge, as follows:

  • Design of XML warehouses based on user requirements and needs - In this case, researchers look at the issue of warehousing XML documents by taking the user’s point of view. That is, the user’s needs in terms of possible future queries on the XML data warehouse are considered to be most critical. [Zhang et al., 2003, Boussaid et al., 2006];
  • Design of XML warehouses based on XML schema or Document Type Definition (DTD) - In this case, the structure of the XML documents is considered to be the most important aspect which needs to be carried into the XML warehouse’s structure. That is, the parent-child and sibling relationships in the XML structural hierarchy need to be also found in the final XML data warehouse. Few works in this area are [Golfarelli et al., 2001, Vrdoljak et al., 2003; Pokorny, 2001; Pokorny, 2002];
  • Conceptual design of XML document warehouses - Conceptual modelling is widely used for database design or software engineering. The most frequent used methods are ER (entity-relationships) diagrams, data flow diagrams, system flowcharts, UML (Unified Modeling Language), etc. Two works which present a conceptual design for an XML warehouse are [Nassis et al., 2004] and [Nassis et al., 2006];
  • Design of XML warehouses as XML data summary - Some researchers believe that the amount of XML data which needs to be warehoused will increase so much that it will be huge and almost impossible to warehouse and query. Therefore, they propose to warehouse a summary of the XML data instead of actual data, in such a way that the result of queries is not affected. [Comai et al., 2003];
  • Analysis of XML warehouses using OLAP - In this case, researchers look at the traditional multidimensional warehousing, which has fact document(s) and dimensions, and try to apply the same OLAP techniques, traditionally applied to the relational data, on the warehoused XML data. Several works in this category are [Huang and Su, 2002; Hummer et al., 2003; Park et al., 2005].
  • Integrated XML document management - In this case, researchers investigate how different XML documents can be integrated with other type of data (text documents, media files etc), or how the information from geographically dispersed XML warehouses can be combined. Two works in this category are [Hsiao et al., 2003; Rajugan et al., 2005].

    However, during the last several years, it has become clear that, due to the specific characteristics of the dynamic XML documents (temporality, unpredictability), they require a different warehousing solution from the static XML documents.

    The research work in this area is still at the beginning. In our previous work [Rusu et al, 2006a], we proposed a framework for warehousing dynamic (multi-versioned) XML documents. As it can be noticed in this framework, three main stages are involved in the warehousing process: Stage A identifies the changes between the incoming successive versions of the dynamic XML documents - the output of this stage is a collection of historical data and changes; in Stage B the data is cleaned and integrated, then in Stage C the final XML data warehouse is built by constructing the fact XML document(s) and the required dimensions. Note that the resultant warehouse is native XML, where all documents are stored in XML format.

    Other works in the same area focus mostly on the management of changes between multiple versions of XML documents, and propose some storage policies for dynamic (multi-versioned) XML documents, as follows:

  • Representing changes from dynamic (multi-versioned) XML documents - In this case, researchers looked at how the changes between consecutive versions of multi-versioned XML documents can be calculated and represented [Marian et al., 2001; Cobena et al., 2002; Chien et al., 2001; Chien et al., 2002; Wang et al., 2003]. In our earlier work [Rusu et al, 2005], we proposed a different solution to represent changes from multi-versioned XML documents, by using the concept of ‘consolidated delta’, which responds better to versioning queries;
  • Storage policies for dynamic (multi-versioned) XML documents – This focuses on finding efficient ways of storing data from multiple versions of XML documents, considering the constraints of similarity between consecutive versions and hence the possible high redundancy of information. Some proposals in this area are [Marian et al., 2001; Chien et al., 2002; Wang & Zaniolo, 2003]. Four different storage techniques were critically analysed and compared in [Rusu et al, 2008a] based on two efficiency indicators.

    3. Current Trends in Data Mining

    This section will first give a high-level overview of data mining, followed by a discussion of some of the latest trends and research work advancements in this domain.

    3.1. Data Mining – An Overview

    As mentioned earlier in this chapter, modern days’ data cannot be labelled as ‘simple’ anymore. The more and more heterogenous, complex and peculiar data in various domains, the more and more intelligent techniques required to mine it and extract interesting knowledge out of it.

    Data mining is a compilation of techniques, methods and algorithms utilised in order to extract knowledge hidden in huge amounts of data, being therefore much more than a list of statistical formulas applied on a collection of data. Few major groups of data mining tasks are [Fayyad et al, 2000]:

  • Predictive modelling (classification, regression)
  • Descriptive modelling (clustering)
  • Pattern discovery (discovering association rules, frequent sequences etc)
  • Dependency modelling and causality (graphical models, density estimation)
  • Change and deviation detection/modelling

    Each data mining tasks would usually reveal a different type of interesting knowledge from the mined data; hence at one point only one task is applied in order to find a specific type of information. For example the task of discovering relationship between data items would require an association rules algorithm, while for grouping data in sets (clusters) based on their similarity a clustering algorithm would be needed. Finding the actual algorithm which suits better to a specific problem is a challenge, especially in very complex applications.

    Also, the point has been raised [Fayyad et al, 2000] that research study of data mining tasks assume a collection of data which is more cleaner, ordered and correctly labelled than can be found in real world applications. Hence, there is a growing need to apply data mining tools on real world data from various domains, of various levels of complexity.

    3.2. Latest trends and advancements in data mining

    Current research work in data mining focuses on complex applications of data mining tasks; this section discusses therefore some of the trends in this area.

    3.2.1. Knowledge discovery from genetic and medical data

    During the last few years, bioinformatics has attracted a lot of attention, from both biologists and data mining specialists. The most critical application in this domain has been extracting patterns (that is, full biological information) from gene expression data.

    Typical tasks in gene expression data mining include clustering, classification and association rules discovery, on extensive microarray data [Han et al, 2006]: o Clustering techniques can be used to identify genes which are co-regulated in a similar manner under different experimental conditions. One-way clustering groups the genes separately, while two-way clustering groups the genes by taking the relationship between them in consideration.

    Some of the work in gene expression clustering is focused on: K-ary clustering [Bar-Joseph et al, 2003], dynamically growing self-organisation trees [Luo et al, 2004], fuzzy K-means clustering [Gasch & Eisen, 2002], and many others.

  • Classification helps in identifying differences between cells, for example healthy (normal) cells from cancer cells. Most current approaches attempt to differentiate groups of genes with similar functionality in an unsupervised mode (that is, without prior knowledge of their true functional classes).

    Some of the work in genes expression classification is focused on: support vector machine [Furrey, 2000], Bayesian classification [Moler et al, 2000], emerging patterns [Boulesteix et al, 2003], and many others.

  • Association rules can show how some genes influence the expression of other genes in regulatory pathways.

    Some of the work in discovering association rules in gene expression data investigates distance based association rule mining [Icev et al, 2003], finding association rule groups [Cong et al, 2004], frequent pattern mining [Pan et al, 2003], etc.

    3.2.2. Text mining

    This category of data mining focuses on extracting useful patterns from text data. Text is the most available option for everyday people of storing and exchanging data: an important part of information exchanged in any company would be via e-mails, reports, notes, memos, etc. Hence, a lot of the specialists’ knowledge might not get to be stored formally in a database or a document storage system, but it can be ‘hidden’ in piles of unstructured or semi-structured text. More, we witness everyday an exponential growth of the text information publicly available on the web, in electronic databases, libraries, discussion forums etc.

    Due to the widespread use and exponential growth of this type of information, a manual analysis would be an unrealistic task. Text mining comes therefore to automate the search, filtering or clustering large amounts of text data (text- or hypertext databases) in order to make use of the hidden information with it.

    Several possible applications of text mining are as follows (and the list is not exhaustive):

  • Analysis of customer profile – this can be extracted from complaints, suggestion/feedback forms, etc;
  • Personalised information service – for example the distribution of newsletters, invitations etc based on customer profile;
  • Personal or national security – scanning emails and messages to identify spam, information leakage, possible threats etc;
  • Plagiarism detection – identifying text repeated in multiple publications, without acknowledging the contribution of the original author.

    During the last few years, Academia has put in a lot of effort in text mining research. For example, a Text-Mining Research Group has been established at the University of West Bohemia (Czech Republic) with the long-term objective of creating “a robust system to extract knowledge from semi-structured data in a multi-language web environment in order to infer new information / knowledge that is not contained explicitly in the original data” [TMRG, 2009]. Their research work has been around automatic text summarization [Ježek & Steinberger, 2008; Steinberger, 2007], plagiarism detection [Ceška, 2008; Ceška et al, 2008], detection of authoritative sources [Fiala, 2007], etc.

    Due to the increased accessibility of people to various resources on the web, future work in text mining will very likely continue to be around plagiarism detection, social networks mining, filtering and classifying web sites based on the text content, etc.

    3.2.3. XML data mining

    Another emerging area of data mining research is XML mining. The effort of storing a huge amount of data in XML format into an effective warehouse structure should not be rewarded only by a number (albeit a high number) of successful queries applied on the warehoused data. Hence, researchers have identified an opportunity to consider various mining tasks which could be applied to the XML data to discover hidden interesting information such as association rules, clusters, classification, frequent patterns etc.

    XML mining includes both mining of structures as well as mining of contents from XML documents [Nayak et al., 2002; Nayak, 2005]. Mining of structure is seen as essentially mining the XML schema and it includes intra-structure mining and inter-structure mining. Mining of content consists of content analysis and structure clarification.

  • Intra-structure mining is concerned with mining the structure inside an XML document, where tasks of classification, clustering or association rules discovering could be applied;
  • Inter-structure mining is concerned with mining the structures between XML documents, where the applicable tasks could be clustering schemas and defining hierarchies of schemas on the web, and classification applied with namespaces and URIs – Uniform Resources Identifiers;
  • Content analysis is concerned with analysing texts within the XML document;
  • Structural clarification is concerned with determining the similar documents, based on their content.

    As mentioned previously in the ‘XML data warehousing’ section, XML documents can be static or dynamic. Therefore during the last few years, research work in mining XML documents has focused on mining both static and dynamic (multi-versioned) XML data, for example: discovering association rules, clustering, finding frequent structural patterns, mining changes between XML document versions etc. We mention here few of the works in these domains:

  • Discovering association rules from static XML documents - Most of the work done in the area of mining association rules from static XML documents use XML-oriented algorithms based on the Apriori algorithm [Agrawal, 1993, 1998]. However, a number of non Apriori-based approaches have been also developed. Few works worth mentioning here are [Wan & Dobbie, 2003; Wan & Dobbie, 2004; Braga et al., 2002a; Braga et al., 2002b; Braga et al., 2003; Feng et al., 2003] and others;
  • Clustering static XML documents - this type of clustering focuses on grouping XML documents in clusters based on their similarity. Clustering can be based on structural similarity [De Francesca et al., 2003; Liang et al., 2004; Costa et al., 2004], or structural and semantic similarity [Yoon et al., 2001; Shen & Wang, 2003]. Some XML clustering techniques are distance-based [Nierman & Jagadish, 2002; Dalamagas et al., 2004, Dalamagas et al., 2006; Xing et al., 2007];
  • Discovering association rules from dynamic (multi-versioned) XML documents - The work in this area is still in its infancy, and only a limited number of existing works have addressed the issue of discovering association rules from multi-versioned XML documents. Few existing proposals in this area are [Chen et al., 2004; Rusu et al, 2006b, Rusu et al, 2006c].
  • Clustering dynamic (multi-versioned) XML documents – This focuses on techniques to re-cluster collections of XML documents after some of the document have changed. One work which proposes such a technique is [Rusu et al, 2008b]. Also, some work has been done on clustering a series of XML documents, when each incoming XML document is different from the previous one and needs to be placed in the correct cluster. One work in this category is [Nayak and Xu, 2006];
  • Extracting patterns of structural changes from XML documents – This type of works try to extract frequently changing structures from sequences of versions of dynamic XML documents, or sequences of deltas (that is, differences between versions). Few works in this category are [Zhao et al., 2004a; Zhao et al., 2004b; Zhao et al., 2006].

    In our opinion, future work in XML mining will continue the research trends in discovering patterns and frequent changing structures from dynamic (multi-versioned) XML documents, and will also need to look at mining tasks which have not been extensively researched so far, such as clustering or classification dynamic XML documents.

    4. Conclusions

    The purpose of this chapter was to highlight some of the trends and advancements made by the research work in data warehousing and data mining. We showed that these trends are not limited to one area, but they are spread across multiple domains. That is, there is extensive research work in warehousing and mining text, spatial or medical data, and also in warehousing and mining XML data, which is increasingly used by many business applications.

    Certainly, the applicability of data warehousing and mining research is not limited to the domains mentioned in this chapter. The other chapters of this book will detail more on such trends and advancements in other modern areas.

    Author(s)/Editor(s) Biography

    David Taniar received his PhD in Databases from Victoria University (Australia, 1997) and is now a Senior Lecturer at Monash University (Australia). He has published more than 100 research articles and edited a number of books in the Web technology series. He is on the editorial board of a number of international journals, including Data Warehousing and Mining, Business Intelligence and Data Mining, Mobile Information Systems, Mobile Multimedia, Web Information Systems, and Web and Grid Services. He has been elected as a Fellow of the Institute for Management of Information Systems (UK).
    Laura Irina Rusu has completed her PhD in 2008 at La Trobe University, Australia, with a thesis on XML data warehousing and mining. Before that, she received a Master in Quantitative Economy degree (1997) and a Bachelor in Computer Science degree (1996), both from the Academy of Economic Sciences - Bucharest, Romania. Currently she is a Postdoctoral Research Fellow at La Trobe University, and her research interests are on dynamic XML data warehousing and mining, partitioning techniques for XML warehouses, and data migration to domain-specific XML standards.

    Indices

    Editorial Board

  • Frank Dehne, Carleton University, Canada
  • Ada Wai-Chee Fu, Chinese University of Hong Kong, Hong Kong
  • Ee-Peng Lim, Singapore Management University, Singapore
  • Feng Ling, Tsinghua University, China
  • Graeme Shanks, The University of Melbourne, Australia
  • Chengqi Zhang, University of Technology, Australia