Project D: Distributed Shortest Path Big Data Implementation

Overview

The goal of this project is to develop a distributed implementation of the Floyd-Warshall shortest path algorithm using existing big data tools.

Client: Brian E. Heilig, E.T. International

Background

The Floyd-Warshall algorithm is a sequential algorithm for finding the shortest paths between any two vertices of a non-directed graph. When the input graph is large, on the order of a billion vertices, the sequential solution becomes intractable due to the inability to store the graph in memory. For large input graphs a distributed solution is required where the input graph is partitioned among the contributing nodes.

The shortest-path solution for large input data sets can be considered a big data problem. There are a number of existing distributed frameworks for dealing with this kind of problem; each framework has its strengths and weaknesses. Apache Hadoop, the most popular big data framework, is insufficient at solving these kinds of graph problems in a reasonable period of time in part due to the lack of in-memory partitioned data structures.

Objectives

The objectives for this project are two-fold. The first objective is the development of a distributed Floyd-Warshall algorithm that uses the primitives of an existing distributed big data framework. Primitives are the fundamental programming entities that are used to define the algorithm such as map and reduce for Apache Hadoop. The algorithm may depend largely on the underlying model of the distributed framework. However, similarities exist between many big data frameworks, such as the use of key/value pairs as inter-process messages, and partitioned key spaces. Standalone algorithms exist but the author hasn't found any implementation that uses an existing big data framework. See for example "A Scalable Parallelization of All-Pairs Shortest Path Algorithm for a High Performance Cluster Environment", T. Srinivasan et al.

The second objective is the implementation and execution of this algorithm using HAMR. HAMR is a big data framework developed at E.T. International, a company founded by professor Guang Gao, Distinguished Professor of Electrical and Computer Engineering at the University of Delaware. HAMR uses transforms and key/value stores as the data processing and storage primitives. It is currently in beta so a degree of difficulty is expected due to documentation and an API that hasn't been thoroughly exercised by external users.

If there is sufficient interest in the project students may endeavor to compare the implementation and execution of HAMR to other big data frameworks, such as Hadoop, Spark (or GraphX), GraphLab, Pregel, or others. Team members could collaborate to develop the distributed algorithm, and work independently to implement the algorithm within a particular framework of choice. However we ask that one of the frameworks is HAMR. Furthermore since HAMR is in beta, we ask for an opportunity to incorporate feedback before any results are published.

Technical Details

The input file format could simply be a list of edges with weights, one edge per line of text. Several publicly available tools exist to generate random undirected graphs given an average degree. Each vertex should be represented by a positive integer. For simplicity the weight should also be positive to avoid negative cycles. For example an edge from vertex 7 to vertex 42 with weight 3.14 could look like 7 3.14 42. Since the graph is undirected the student will have to consider whether the input is symmetric. That is if an edge from a to b exists, should an identical edge from b to a also exist, or is it implied? Alternatively ETI can provide input files.

The output file format should be an NxN matrix of vertices and their minimal distances to every other vertex.

A cluster will be required to test the implementation(s). A minimum of four nodes is recommended.

All development will be in Java.

Implementations should be tested against an implementation of the sequential Floyd-Warshall algorithm.

Deliverables

The team is expected to deliver all source code, a brief writeup indicating how the implementation(s) were verified, and another brief writeup indicating the student's experience with HAMR while implementing this algorithm. If the algorithm is implemented using other frameworks then the writeup should include comparison information, including the relative ease of implementation, difficulties implementing each version, and the relative performance.