Mining Parallel Knowledge from Comparable Patents

Mining Parallel Knowledge from Comparable Patents

Bin Lu (City University of Hong Kong, Hong Kong), Benjamin K. Tsou (City University of Hong Kong, Hong Kong & Hong Kong Institute of Education, Hong Kong), Tao Jiang (ChiLin Star Corporation, China), Jingbo Zhu (Northeastern University, China) and Oi Yee Kwong (City University of Hong Kong, Hong Kong)
DOI: 10.4018/978-1-60960-625-1.ch013
OnDemand PDF Download:
No Current Special Offers


The extracted parallel sentences and technical terms could be a good basis for further acquisition of term relations and the translation of monolingual ontologies, as well as for statistical machine translation systems and other cross-lingual information access applications.
Chapter Preview

1. Introduction

Ontology learning from text has been attracting more attention from different research communities, such as natural language processing, machine learning, knowledge representation/engineering and user interface design (Brewster et al., 2003; Cimiano, 2006; Buitelaar and Cimiano, 2007; Hjelm, 2009; Wong, 2009). Ontologies could be used by computers to reason over the terms and relations, as well as to deduce new information that might not be found explicitly in the ontologies. The constructed ontologies would finally contribute to the realization of the Semantic Web (Berners-Lee et al., 2001; Shadbolt et al., 2006).

Patent documents containing a large amount of technical terms could be a good source for learning technical terms and their relations, and multilingual patents would be useful further for learning not only monolingual ontologies but also multilingual ontologies. Little work on ontology learning from patent documents has been done, and even less on ontology learning from multi-lingual parallel or comparable patents.

In this chapter, we present our experimental work on mining parallel sentences and technical terms from comparable Chinese-English patent documents. Part of the current chapter is based on our previously published work (Lu et al, 2009; Lu & Tsou, 2009). When compared to comparable patents, a parallel corpus of matched equivalent sentences is an invaluable resource for many applications, such as multilingual ontology learning, machine translation, and cross-lingual information retrieval. However, obtaining a large-scale parallel corpus is much more expensive than obtaining a comparable bilingual corpus. From our corpus of about 7,000 Chinese-English comparable patents with titles, abstracts, claims and full texts, we try to address the following three issues:

  • 1.

    Parallel sentence extraction: alignment of only parallel sentences in the comparable patents by combining three quality measures, thereby deriving a useful parallel corpus of sentences. The experiments show that high-quality parallel sentences can be obtained by aligning sentences and filtering sentence alignments with the combination of different quality measures.

  • 2.

    Bilingual term extraction: identification of bilingual technical terms by combining both linguistic and statistical information under an SVM classifier. Based on the high-quality parallel sentences extracted, bilingual technical terms, including both single-word terms and multi-word ones can be readily identified by combining Part-of-Speech (POS) patterns and statistical scores given by a word alignment tool. Meanwhile, linguistic and statistic features can further improve the performance of bilingual term extraction via the machine learning approach (i.e. an SVM classifier).

  • 3.

    Chinese to English Statistical Machine Translation (SMT): automatic translation of patents from Chinese to English based on an SMT engine trained on the mined parallel sentences. An SMT engine trained on the parallel sentences achieves promising BLEU scores.

Given the relative paucity of parallel patent data, the use of such comparable corpus for mining parallel knowledge would be a helpful step towards multilingual ontology learning and other cross-lingual access applications in the patent domain, such as MT, cross-lingual information retrieval. The extracted parallel sentences and technical terms could be a good basis for further acquisition of attributes, term relations, as well as for the translation of monolingual ontologies since most current ontologies are monolingual.

In the next section we introduce the background of the research. Then the comparable Chinese-English patent corpus and its preprocessing are described in Section 3. Our approaches on parallel sentence extraction, bilingual term extraction and the SMT experiment are presented in Section 4, 5 and 6, respectively. Discussion is given in Section 7, and we conclude in Section 8.

Complete Chapter List

Search this Book: