Data mining is aimed at the extraction of interesting (i.e., nontrivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amounts of data. In order to make data mining manageable, data mining has to be database centered. Yet, data mining goes beyond the traditional realm of database techniques; in particular, reasoning methods developed from machine learning techniques and other fields in artificial intelligence (AI) have made important contributions in data mining. Data mining thus offers an excellent opportunity to explore the interesting fundamental issue of the relationship between data and knowledge retrieval and inference and reasoning. Decades ago, researchers made an important remark stating that since knowledge retrieval must respect the semantics of the representation language, knowledge retrieval is a limited form of inference operating on the stored facts (Frisch & Allen, 1982). The inverse side of this statement has also been explored, which views inference as an extension of retrieval. For example, Chen (1996) described a computer model that is able to generate suggestions through document structure mapping based on the notion of reasoning as extended knowledge retrieval; the model was implemented using a relational approach. However, although the issue of foundations of data mining has attracted much attention among data mining researchers (ICDM, 2004), little work has been done on the important relationship between retrieval and inference (or mining). A possible reason of lacking such kind of research is the difficulty of identifying an appropriate common ground that can be used to examine both data retrieval and data mining.
Just like a B Tree, an R Tree (Guttman, 1984) relies on a balanced hierarchical structure, in which each tree node is mapped to a disk page. However, whereas B Trees are built on single-value keys and rely on a total order on these keys, R Trees organize rectangles according to a containment relationship. Each object to be indexed will be represented by a minimum bounding box (MBB) in the index structure except the point for which an MBB simply degrades to a point. All indexed objects will eventually be put in leaf nodes. A leaf node contains an array of leaf entries. A leaf entry is a pair (mbb, oid), where mbb is the MBB and oid is the object ID. Each internal node is associated with a rectangle, referred to as the directory rectangle (dr), which is the minimal bounding box of the rectangle of its child nodes. The structure of R Tree satisfies the following properties.
Key Terms in this Chapter
Principal Component Analysis (PCA): PCA performs orthogonal linear transformation, which transforms the data to a new coordinate system to reduce multidimensional data sets to lower dimensions for analysis.
Spatial Index: Spatial indexes are used by databases involving spatial data to optimize spatial queries. Indexes used by nonspatial databases cannot effectively handle features such as how far two points differ and whether points fall within a spatial area of interest.
Cluster Analysis: The concept of cluster refers to a collection of data objects where data objects within the same cluster are similar to one another while dissimilar to the objects in other clusters. Cluster analysis is aimed at finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.
Data Mining: As a key step in the knowledge discovery from data (KDD) process, it is intended to extract interesting (nontrivial, implicit, previously unknown, and potentially useful) patterns. Inference: The act or process of deriving a conclusion from stored data or known facts. While cognitive psychology studies human inference, automated inference algorithms have been studied in artificial intelligence and in its subfield machine learning.
Retrieval: The act of finding previously stored data in an information system to answer a query (which represents a user’s information need). Retrieval can be done on structured data (such as in database management systems), unstructured data (such as in information retrieval), or multimedia data (such as image retrieval or music retrieval). R* Trees: R* tree is a variant of R tree for indexing spatial information. Minimization of both coverage and overlap is crucial to the performance of R trees. The R* tree attempts to reduce both, using a combination of a revised node-split algorithm and the concept of forced reinsertion at node overflow.
Similarity Retrieval: Unlike conventional data retrieval, similarity retrieval is aimed so that, given a particular query, data objects sharing certain commonality with the given query (rather than identical with the given query) will be retrieved. Similarity retrieval is widely used in information retrieval and multimedia retrieval (such as retrieval of images similar to a given image).