Large Scale Graph Mining with MapReduce: Counting Triangles in Large Real Networks

Large Scale Graph Mining with MapReduce: Counting Triangles in Large Real Networks

Charalampos E. Tsourakakis (Carnegie Mellon University, USA)
Copyright: © 2012 |Pages: 16
DOI: 10.4018/978-1-61350-053-8.ch013
OnDemand PDF Download:


In this Chapter, we present state of the art work on large scale graph mining using MapReduce. We survey research work on an important graph mining problem, counting the number of triangles in large-real world networks. We present the most important applications related to the count of triangles and two families of algorithms, a spectral and a combinatorial one, which solve the problem efficiently.
Chapter Preview


The total digital output is expected to exceed 1.2 ZetaBytes in 2010 (Blake, 2010). The New York Stock Exchange generates about one terabyte of new trade data per day and Facebook hosts approximately 10 billion photos, taking up one PetaByte of storage (White, 2009). It has become apparent that as the amount of data generated increases at this unprecedented rate, scalability of algorithms is crucial. In recent years, MapReduce (Dean et al., 2008) and Hadoop (Hadoop Wiki, 2010), its open source implementation, have become the de facto standard for analyzing large datasets. Despite its limitations, the MapReduce framework stands out for making the programmer's life who uses MapReduce to develop applications easy. Specifically, from the programmer's perspective, MapReduce is just a library imported at the beginning of the program, like any other common library. MapReduce takes care of the parallelization and all its details including distributing the data over the cluster and fault tolerance. In the next Section we provide more details on MapReduce and Hadoop. According to (Hadoop Users, 2010) over 70 major companies over the world use Hadoop. Furthermore, innovative commercial ideas like Amazon's Elastic Compute Cloud (EC2) where users can upload large data sets and rent processor time in a large Hadoop cluster have proved successful. Besides companies, MapReduce and Hadoop have become also the de facto standard for research. Several universities including Carnegie Mellon University, Cornell and Berkeley are using Hadoop clusters for research purposes. Projects include text processing, analysis of large astronomical datasets and graph mining. Currently, (Pegasus CMU, 2010) provides an open source Hadoop-based library for performing important graph mining operations.

In this Chapter, we survey state-of-the-art work related to triangle counting using MapReduce. The interested reader is urged to study the original publications (Tsourakakis, 2008), (Tsourakakis et al., 2009a), (Tsourakakis et al., 2009b), (Tsourakakis et al., 2009c), (Tsourakakis, 2010), (Tsourakakis et al., 2011) which we survey in this Chapter for the full details of the algorithms. The outline of this Chapter is as follows: in Section 2 we provide a brief description of MapReduce and Hadoop, an open source package which includes a freely available implementation of MapReduce. In Section 3 we present work related to the triangle counting problem, in Section 4 the implementation details and in Section 5 future research directions. Finally, in Section 6 we conclude. For the interested reader, we provide at the end of the Chapter additional reading material.

Complete Chapter List

Search this Book: