Article Preview
TopIntroduction
In many information systems, different information processing components are required for building intelligent applications (Su et al., 2009; Mohanty et al., 2010). Knowledge bases are particularly useful in aiding decision making as expert knowledge can be flexibly captured and utilized. Expert knowledge can be represented as comprehensible rules for decision making in different applications (Chandra & Ravi, 2009; Liang & Rubin, 2009). However, we often encounter situations where we already have an existing knowledge base from a source domain and we wish to apply it to solve the same task in a target domain which is different from the source domain. Typically, direct application of the source knowledge base to the target domain would result in large degradation in performance due to the difference between the two domains. One solution is to acquire expert knowledge for the target domain to manually refine the knowledge base. Alternatively, another solution is to collect sufficient amount of labeled data via manual annotations in the target domain so that the knowledge base can be automatically discovered. But additional expert knowledge is expensive to acquire and manual annotations for sufficient data in the target domain may be costly or even infeasible. Hence, a useful approach would be refining the existing available source domain knowledge base to the target domain using a very small amount of labeled target domain data. Labeled data refers to pieces of information containing the answers or labels provided by experts to certain queries in the domain. An automated computer algorithm can be developed for analyzing the data and automatically constructing a model for solving the task related to the domain. This model can be regarded as a knowledge base which can aid the prediction of the answers to queries given some observations.
We investigate the refinement of an existing knowledge base represented in Markov Logic Networks (MLN) (Richardson & Domingos, 2006). A standard MLN is a combination of probabilistic and first-order logic graphical models. It consists of a first-order knowledge base which is a set of first-order logic formulae describing the logic relations of the task and a set of weights, in which a weight is associated with each formula. The representation of first-order logic enables flexible model construction capturing knowledge such as relations among entities. The problem setting investigated in this paper is described as follows. Suppose we need to solve a particular task, typically an existing source domain MLN suitable for problem solving in the source domain is available. Now we wish to refine it so that it is suitable for the target domain. During the refinement, a limited amount of target domain data is selected automatically and the truth values (annotations) of the queries to the data are acquired from experts. This limited amount of labeled target domain data and the remaining unlabeled target domain data are used to refine the source domain MLN for the target domain. Note that unlabeled target domain data refers to the data elements not selected for annotations.
In our previous work (Chan et al., 2010), we have proposed a method for logic relation refinement using unlabeled data only. In this current paper, we propose a new MLN knowledge base refinement framework based on pattern mining and active learning. Our method first analyzes the unlabeled target domain data and actively asks the expert to provide labels (or answers) for a very small amount of automatically selected queries. The idea is to identify the target domain queries whose underlying relations are not sufficiently described by the existing source domain knowledge base. Although the source and the target domains may have different underlying data distributions, they must also share certain similarities since they solve the same task. Potential relational patterns in the unlabeled target domain data are discovered and new logic formulae are constructed.