A Hierarchical Online Classifier for Patent Categorization
Domonkos Tikk (Budapest University of Technology and Economics, Hungary), György Biro (TextMiner Ltd., Hungary) and Attila Törcsvári (Arcanum Development Ltd., Hungary)
Copyright: © 2008
Abstract: Patent categorization (PC) is a typical application area of text categorization (TC). TC can be applied in different scenarios at the work of patent offices depending on at what stage the categorization is needed. This is a challenging field for TC algorithms, since the applications have to deal simultaneously with large number of categories (in the magnitude of 1000–10000) organized in hierarchy, large number of long documents with huge vocabularies at training, and they are required to work fast and accurate at on-the-fly categorization. In this paper we present a hierarchical online classifier, called HITEC, which meets the above requirements. The novelty of the method relies on the taxonomy dependent architecture of the classifier, the applied weight updating scheme, and on the relaxed category selection method. We evaluate the presented method on two large English patent application databases, the WIPO-alpha and the Espace A/B corpora. We also compare the presented method to other TC algorithms on these collections, and show that it outperforms them significantly.