The Clustering of Large Scale E-Learning Resources

The Clustering of Large Scale E-Learning Resources

Fei Wu (Zhejiang University, China), Wenhua Wang (Zhejiang University, China), Hanwang Zhang (Zhejiang University, China) and Yueting Zhuang (Zhejiang University, China)
DOI: 10.4018/978-1-60566-380-7.ch006


E-learning resources increase vastly with the pervasion of the Internet. Thus, the retrieval of e-learning resources becomes more and more important. This chapter introduces an approach to retrieve e-learning resources from large-scale dataset. The basic idea behind that method is, the authors cluster the whole resources into topics first, and only search from those clusters which are the most tightly relevant to the query. To make the clustering feasible to large-scale dataset, the authors adapt affinity propagation in MapReduce framework and therefore the so called parallel affinity propagation is proposed. The proposed approach could improve the retrieval of e-learning resources by understanding users’ underlying intentions.
Chapter Preview

With the pervasion of the Internet, e-learning is more and more popular, which provides a brand new way for people to learn without attending face-to-face class. With e-learning, the student and the teacher use online technology to interact, which profits from a combination of techniques including computer networks, multimedia, content portals, digital libraries, search engines, etc.. The worldwide e-learning industry is estimated to be worth over 38 billion euros according to conservative estimates. With the prevalence of e-learning, the amount of learning resources also grows exponentially, which makes it not feasible to access them only by clicking links. Thereby, an effective mechanism is needed to locate the resources, with which people could find the e-learning materials they want with facility. To accurately locate the e-learning materials a user is seeking for, system has to guess the user’s underlying intentions from the text typed in, rather than merely return the results from literally matching, particularly when the user is not familiar with the terminologies of the field which he/she is trying to learning. Thus, leveraging the data mining technology to locate the resources semantically related to the querying text becomes meaningful. Nowadays, the e-learning resources comprise texts, images, videos, audios and materials in other modalities, however, text materials are the best choice to be analyzed and understood, in that texts account for the biggest part and only the text resources reflect the information most directly. Besides, taking efficiency and the expensive mining cost into account, it’s reasonable to focus only on the text materials and to neglect materials in other modalities. Therefore, e-learning materials and e-learning resources in this chapter mostly mean e-learning text documents.

In data mining technology, clustering is to partition a data set into subsets, so that the data in each subset share some common trait. See (Jain et al.,1999; Xu et al., 2005) for details. Therefore, clustering is an effective method to discover clues when little is known about the data. Besides, e-learning resources are intrinsically appropriate to be clustered, in that fields of materials concerning likely overlap fields of others, and materials concerning similar fields probably use the same words, particularly the same terminologies. For example, two physics books will likely use words such as ‘energy’, ‘force’, ‘mass’, and ‘charge’ repeatedly, which consequently strengthens the correlation between the books . So we adopt the clustering method to preprocess the e-learning resources to mine the correlations among the materials.

Traditionally, measures of text similarity have been used for a long time in applications in natural language processing and related areas (Corley & Mihalcea, 2005) .One of the earliest applications of text similarity is perhaps the vector model in information retrieval, where the document most relevant to a user’s query is determined by ranking documents in a collection in reversed order of their similarity to the given query (Salton & Lest, 1968). In the vector space model, a document is represented by a vector indexed by the terms of the corpus, so two documents that use semantically related but distinct words will therefore show no similarity (Kandola et al., 2002). Many methods were proposed to explicitly or implicitly discover the similarity between different terms, such as (Landauer, Foltz & Laham, 1998; Corley & Mihalcea, 2005; Kandola et al., 2002), as well as other WordNet based methods (Budanitsky & Hirst, 2001) .

Key Terms in this Chapter

Singular Value Decomposition: In linear algebra, the singular value decomposition (SVD) is an important factorization of a rectangular real or complex matrix, with several applications in signal processing and statistics. Applications which employ the SVD include computing the pseudoinverse, least squares fitting of data, matrix approximation, and determining the rank, range and null space of a matrix.

Latent Semantic Indexing: A technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.

Hadoop: A free Java software framework that supports data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. It is inspired by Google’s MapReduce.

MapReduce: A software framework introduced by Google to support distributed computing on large data sets on clusters of computers. The framework is inspired by map and reduce functions commonly used in functional programming, MapReduce libraries have been written in C++, Java, Python and other programming languages.

E-Learning: A type of technology supported education/learning (TSL) where the medium of instruction is computer technology. In some instances, no in-person interaction takes place. E-learning is used interchangeably in a wide variety of contexts. In companies, it refers to the strategies that use the company network to deliver training courses to employees. In the USA, it is defined as a planned teaching/learning experience that uses a wide spectrum of technologies, mainly Internet or computer-based, to reach learners. Lately in most Universities, e-learning is used to define a specific mode to attend a course or programmes of study where the students rarely, if ever, attend face-to-face for on-campus access to educational facilities, because they study online.

Clustering: Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

Affinity Propagation: A clustering method, which starts by considering all the data points as potential exemplars, and then recursively transmits real-valued messages along edges of the network whose nodes are data points.

Complete Chapter List

Search this Book: