Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic Algorithms (GA), Artificial Neural Networks, Fuzzy Logic, Rough Sets and Support Vector Machines (SVM) when used in combination was found to be effective. Therefore, soft computing algorithms are used to accomplish data mining across different applications (Mitra S, Pal S K & Mitra P, 2002; Alex A Freitas, 2002). Extensible Markup Language (XML) is emerging as a de facto standard for information exchange among various applications of World Wide Web due to XML’s inherent data self-describing capacity and flexibility of organizing data. In XML representation, the semantics are associated with the contents of the document by making use of self describing tags which can be defined by the users. Hence XML can be used as a medium for interoperability over the Internet. With these advantages, the amount of data that is being published on the Web in the form of XML is growing enormously and many naïve users find the need to search over large XML document collections (Gang Gou & Rada Chirkova, 2007; Luk R et al., 2000).
The SVM is an efficient and principled method used for classification and regression purposes. The SVM is capable of classifying linearly separable and non-linearly separable data. GA is an effective technique for searching enormous, possibly unstructured solution spaces. The human search strategy which is efficient for small documents is not viable when performing search over enormous amounts of data. Hence, making search engines cognizant of the search strategy using GA can help in fast and accurate search over large document collections.
The topic categorization of XML documents poses several new challenges. The tags in XML represent the semantics of the contents of the document and thus are more significant than the contents during the process of classification. Therefore, a general framework which assigns equal priority to both, the tags and the contents of an XML document will not be able to exhibit any significant performance improvement. Thus, a topic categorization framework with prominence to tags will be highly efficient. The possibility of topic categorization of XML documents using SVM is explored in (Srinivasa K G et al., 2005).
A Selective Dissemination of Information (SDI) system helps users to cope with the large amount of information by automatically disseminating the knowledge to the users in need of it. Therefore, the selective dissemination is the task of dispatching the documents to the users based on their interests. Such systems maintain user profiles to judge the interests of the users and their information needs. The new documents are filtered against the user profiles, and the relevant information is delivered to the corresponding users. In XML documents, the utilization of user defined tags is of great importance to improve the effectiveness of the dissemination task. The possibility of selective dissemination of XML documents based on a user model using Adaptive GAs is addressed in (Srinivasa K G et al., 2007:IOS Press).