Distributed Algorithms for Finding Meta-Paths of a Large Heterogeneous Information Network on Cloud

Distributed Algorithms for Finding Meta-Paths of a Large Heterogeneous Information Network on Cloud

Phuc Do (University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam)
Copyright: © 2020 |Pages: 27
DOI: 10.4018/978-1-7998-1082-7.ch011

Abstract

Meta-path is an important concept of heterogeneous information networks (HINs). Meta-paths were used in many tasks such as information retrieval, decision making, and product recommendation. Normally meta-paths were proposed by human experts. Recently, works on meta-path discovery have proposed in-memory solutions that fit in one computer. With large HINs, the whole HIN cannot be loaded in the memory. In this chapter, the authors proposed distributed algorithms to discover meta-paths of large HINs on cloud. They develop the distributed algorithms to discover the significant meta-path, maximal significant meta-path, and top-k meta-paths between two vertices of HIN. Calculation of the support of meta-paths or performing breadth first search can be computational costly in very large HINs. Conveniently, the distributed algorithms utilize the GraphFrames library of Apache Spark on cloud computing environment to efficiently query large HINs. The authors conduct the experiments on large DBLP dataset to prove the performance of our algorithms on cloud.
Chapter Preview
Top

Introduction

Heterogeneous Information Networks (HINs) is a graph model in which both arcs and vertices are associated with types. Real world data is often represented as a HIN, for example, DBLP bibliographic networks, FreeBase, YaGo. In DBLP bibliographic networks, vertices have several vertex types such as authors, papers, venues, topics, words. These vertex types are connected via various link types such as write relation (author─[writes]→paper), cite relation (paper─[cite]→paper), publishedAt relation (paper-[published_At] →venue), to name but a few. Let us consider an example from DBLP. There are three vertices in the graph, namely “Rakesh Agrawal”, “Fast algorithms for mining association rules”, “VLDB”, with their corresponding vertex type “author”, “paper” and “venue”. The two links connecting between three vertices are labeled with write and publishedAt relation, which describe the fact “Rakesh Agrawal” published paper named “Fast algorithms for mining association rules” at the VLDB conference. With a vast number of interrelated facts study on HINs has gained traction recently.

One of the important concepts in HINs is meta-path, or sequences of vertex types and link types between two given nodes (Y. Sun, J. Han, X. Yan, P. Yu, and T. Wu, 2011). In the DBLP example, we have a meta-path of length one: author─[writes]→paper. We can also have a path length of more than one, such as author─[write]→ paper-[published_At]→venue, which explains an indirect relation between author and venue. Since a meta-path can express various kinds of relationships among nodes, ranging from one-hop relation (path length equal to one) to many-hop relation (path length larger than one), it has been extensively used in many HIN applications such as Similarity Measure PathSim (Y. Sun, 2011), HeteSim (Chuan Shi X. K., 2013), Clustering and Classification (Xiangnan Kong, 2012). In these applications, meta-paths are provided by domain experts, which can be challenging in some scenarios of large HINs. First, domain experts may not always be available. Secondly, when HINs becomes larger, manually selecting meta-paths can be labor-intensive. Hence, there is a need for a tool to query meta-paths efficiently in large HINs. While most of extant meta-path based studies focus on analytical applications of meta-paths (Chuan Shi Y. L., 2017), there are recent works proposing automatic solutions to mine meta-paths from HINs. However, those solutions assume the input HIN must fit into the memory of one computer. It is not clear how to extend these works in distributed environments with many worker computers on Cloud computing environment to process large HIN that cannot fit into memory of one computer. With Cloud computing environment, we can leverage the power of computing and data storage of many computers to solve the challenging problem (Christos Stergiou, Kostas E. Psannis, Brij B. Gupta, Yutaka Ishibashi, 2018).

A majority of meta-path based studies focus on its HIN applications. In (Y. Sun, J. Han, X. Yan, P. Yu, and T. Wu, 2011), the authors study similarity search defined among the same type of vertices in HINs. They propose the concept of meta-path and PathSim, a meta-path based similarity between two vertices of the same type. Through experiments, the authors show the effectiveness of using PathSim over random-walk based similarity measures. Subsequently, there are works focusing on analytic applications of meta-paths.

Complete Chapter List

Search this Book:
Reset