Automatic Syllabus Classification Using Support Vector Machines
Xiaoyan Yu (Virginia Tech, USA), Manas Tungare (Virginia Tech, USA), Weigo Yuan (Virginia Tech, USA), Yubo Yuan (Virginia Tech, USA), Manuel Pérez-Quiñones (Virginia Tech, USA) and Edward A. Fox (Virginia Tech, USA)
Copyright: © 2009
Syllabi are important educational resources. Gathering syllabi that are freely available and creating useful services on top of the collection presents great value for the educational community. However, searching for a syllabus on the Web using a generic search engine is an error-prone process and often yields too many irrelevant links. In this chapter, we describe our empirical study on automatic syllabus classification using support vector machines (SVM) to filter noise out from search results. We describe various steps in the classification process from training data preparation, feature selection, and classifier building using SVMs. Empirical results are provided and discussed. We hope our reported work will also benefit people who are interested in building other genre-specific repositories.
The four classes of syllabi are defined in Table 1. A syllabus component is one of the following information: course code, title, class time and location, offering institute, teaching staffs, course description, objectives, Web site, prerequisite, textbook, grading policy, schedule, assignment, exam and resources. We consider only the full and the partial classes as syllabi. The reason we treat a partial syllabus as a syllabus that we can complete a partial syllabus by following outgoing links from it, which would be helpful for a variety of services. For example, in order to recommend papers or textbooks for a course using a partial syllabus, it is inaccurate just to extract frequent words from its syllabus since more features of the course are described in other pages. Therefore, we would like to recognize partial syllabi and then retrieve more complete information from them. Similarly, we also need to differentiate between an entry page and a noise page, although we consider neither of them as syllabi.Table 1.
|Full||A syllabus without links to other syllabus components||T||F|
|Partial||A syllabus along with links to other syllabus components somewhere else||T||T|
|Entry Page||A page that contains a link to a syllabus||F||T|
Key Terms in this Chapter
Model Training: A procedure in supervised machine learning that estimates parameters for a designed model from data set with known classes.
Support Vector Machines (SVM): A supervised machine learning classification approach with the objective to find the hyperplane maximizing the minimum distance between the plane and the training data points.
Model Testing: A procedure performed after model training that applies the trained model to a different data set with known classes and evaluates the performance of the trained model.
Syllabus Entry Page: A page that contains a link to a syllabus.
Partial Syllabus: A syllabus along with links to more syllabus components at another location.
Syllabus Component: One of the following information course code, title, class time and location, offering institute, teaching staffs, course description, objectives, Web site, prerequisite, textbook, grading policy, schedule, assignment, exam and resources
Full Syllabus: A syllabus without links to other syllabus components.
Feature Selection: Feature selection for text documents is a method to solve the high dimensionality of the feature space by selecting more representative features. Usually the feature space consists of unique terms occurring in the documents.
Text Classification: The problem of automatically assigning predefined classes to text documents.