Download and Structured Processing for the Wikipedia Corpus
In Wikipedia, each entry is established and organized by define rules, the main structural elements include explanatory pages (topic pages and category pages), special pages (e.g. redirect pages and disambiguation pages), templates and information boxes and so on (Li Yun, 2009). Explanatory pages are the most important part of Wikipedia and can be viewed as the semantic context of the concepts. A topic page in Wikipedia corresponds to a topic concept, which is edited by Wikipedia's contributors. Category pages mainly reflect the upper or lower relationship between categories, as well as all the pages a lower category contained. Through category pages, Wikipedia normatively organizes the large number of pages. Redirect pages and disambiguation pages are important resources when mining semantic information of Wikipedia, and it can be used to create synonyms thesaurus and word sense disambiguation thesaurus. Information boxes with a high structured degree, is an important structural source for semantic information mining.
In order to make information coverage of the knowledge base as wide as possible, “topic pages” and “category pages”(including simplified and traditional) in the Chinese field until November 5, 2012 were downloaded from the Wikipedia official open source site. And through conversion by simplified/and traditional interchanging API interface of Microsoft, finally, we collected 2,500,000 Chinese entries corresponding to topic pages, and 270,000 category entries.