Article Preview
TopIntroduction And Problematic
The information universe, is now enduring a big revolution that involves all sectors globally and in dissimilar fields. Today's, the web is the greatest accumulation of data, especially with the evolution of the communication means Internet / Intranet, have given a birth to a new concepts like, big data that reflects the large amount of unstructured data (textual documents) available online /offline. The digital society was enriched every day with a new substance, which makes it difficult to manage. For this reason we need to develop tools that help us to find within a reasonable time the desired information, performing certain tasks in our place, and facilitate our life.
In the past 20 years and with the development of computers, visualization tools and the instruments for automatic processing of information, as data mining, applied to extract the valuable information from a large volumes of data. We attempted in our proposed work, to consider that we have a huge quantity of textual documents, and we ask a person to classify them according to its domain, without any external help. However, this person has, no cognitive background about these documents. This process requires that the classifier must read all the documents in order to get the links between them. This kind of costly problem represents the virtual image of our work in the machine called clustering, in which the aim is to treat a set of textual documents and arrange them in homogenous classes of reflections where the documents of the same class must be similar, and the ones of different classes must be as dissimilar as possible.
Many works had been done over this area and several systems had seen the light, based on classical techniques that are faced with multiple obstacles:
- •
The pick of the distance measure criterion
- •
The selection of the texts representation method
- •
The initialisation of the cluster number
- •
Execution time caused by the number of documents existed
The current scientific world, was considerably built up with the inaugural appearance of novel concepts and prototypes. Actually, for each encountered problem, we must observe the nature; it may already have the same problem, where it had found solutions, long years ago. The bio-mimicry, consists to copy the living by getting advantages from solutions and innovations made by nature. In this paper, we will imitate the lifestyle of social bees, in order to introduce a new artificial model called Distances Combination by Social Bees (DC-SB) to solve the problem of text clustering that represents a topicality challenge in the scientific middle. Our problematic, is placed in the intersection of several subjects as shown in figure 1.
TopState Of The Art
The automatic classification (clustering) has attracted considerable attention from research and industry, various documents have been published in the subject, and many commercial systems and software are being developed. They provide a very important meaning in modern life where all texts clustering systems follow the same process: I) text representation, ii) construction of distance matrix, iii) modelling, iiii) evaluation of results (Buhmann, 2003).