Brief announcement: speedups for parallel graph triconnectivity (original) (raw)

Better speedups using simpler parallel programming for graph connectivity and biconnectivity

2012

Speedups demonstrated for finding the biconnected components of a graph: 9x to 33x on the Explicit Multi-Threading (XMT) many-core computing platform relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two recent NVIDIA GPUs of similar or greater silicon area. Previous studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Ease-of-programming: The paper suggests that parallel programming for the XMT platform is considerably simpler than for the SMP and GPU ones. Unlike the quantitative speedup results, the ease-of-programming comparison is more qualitative. Productivity of parallel programming is a central interest of PMAM/PPoPP strongly favoring ease-ofprogramming. We believe that the discussion is on par with the state of the art on this relatively underexplored interest. The results provide new insights into the synergy between algorithms, the practice of parallel programming and architecture: (1) no single biconnectivity algorithm is dominant for all inputs; (2) XMT provides good performance for each algorithm and better speedups relative to other platforms; (3) the textbook (TV) PRAM algorithm was the only one that provided strong speedups on XMT across all inputs considered; and (4) the TV implementation was a direct implementation of a PRAM algorithm, though a nontrivial effort was needed to get a PRAM version with lower constant factors. Overall, it appears that previous low speedups on other platforms were not caused by inefficient algorithms or their programming. Instead, it is because of the better match between the algorithms and the XMT platform. Given the growing interest in adding architectural support for parallel programming to existing multi-cores, our results suggest the following open question: can such added architectural support catch up on speedups and ease-of-programming with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

A simple algorithm for triconnectivity of a multigraph

2009

Vertex-connectivity and edge-connectivity represent the extent to which a graph is connected. Study of these key properties of graphs plays an important role in varieties of computer science applications. Recent years have witnessed a number of linear time 3-edge-connectivity algorithms - with increasing simplicity. In contrast, the state-of-the-art algorithm for 3-vertex-connectivity due to Hopcroft and Tarjan lacks the simplicity in the sense of ease of implementation as well as the number of passes over the graph although its time and space complexity is theoretically linear. In this paper, we propose a linear time reduction from 3-vertex-connectivity to 3-edge-connectivity of a multigraph. This reduction was previously unknown, while the reduction in the opposite direction already exists. We apply an existing linear time 3-edge-connectivity algorithm on the reduced graph for solving the 3-vertex-connectivity of the original graph. Hence, for a graph with |V| vertices and |E| edg...

An Experimental Study of Parallel Biconnected Components Algorithms on Symmetric Multiprocessors (SMPs

We present an experimental study of parallel biconnected components algorithms employing several fundamental parallel primitives, e.g., prefix sum, list ranking, sorting, connectivity, spanning tree, and tree computations. Previous experimental studies of these primitives demonstrate reasonable parallel speedups. However, when these algorithms are used as subroutines to solve higher-level problems, there are two factors that hinder fast parallel implementations. One is parallel overhead, i.e., the large constant factors hidden in the asymptotic bounds; the other is the discrepancy among the data structures used in the primitives that brings non-negligible conversion cost. We present various optimization techniques and a new parallel algorithm that significantly improve the performance of finding biconnected components of a graph on symmetric multiprocessors (SMPs). Finding biconnected components has application in fault-tolerant network design, and is also used in graph planarity testing. Our parallel implementation achieves speedups up to 4 using 12 processors on a Sun E4500 for large, sparse graphs,

A simple and practical linear-work parallel algorithm for connectivity

Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, 2014

Graph connectivity is a fundamental problem in computer science with many applications. Sequentially, connectivity can be done easily using a simple breadth-first search or depth-first search in linear work. There have been many parallel algorithms for connectivity. However the simpler parallel algorithms require superlinear work, and the linear-work polylogarithmic-depth parallel algorithms are very complicated and not amenable to implementation. In this work, we address this gap by describing a simple and practical expected linear-work, polylogarithmic depth parallel algorithm for graph connectivity. Our algorithm is based on a recent parallel algorithm for generating low-diameter graph decompositions by Miller et al. [42], which uses parallel breadth-first searches. We discuss a (modest) variant of their decomposition algorithm which preserves the theoretical complexity of the algorithm while leading to a simpler and faster implementation. We experimentally compare the connectivity algorithms using both the original decomposition algorithm and our modified decomposition algorithm. We also experimentally compare against the fastest existing parallel connectivity implementations and show that we are competitive for large input graphs (0.88-1.41 times faster on a 40-core machine). In addition, we compare our algorithms to the fastest sequential connectivity algorithm, and show that we achieve 9-19 times speedup relative to the sequential implementation on 40 cores. We discuss the various optimizations used in our algorithms and present extensive experimental analysis of the performance of our algorithms. Our algorithm is the first parallel connectivity algorithm that is both theoretically and practically efficient.

Techniques for Designing Efficient Parallel Graph Algorithms for SMPs and Multicore Processors

Graph problems are finding increasing applications in high performance computing disciplines. Although many regular problems can be solved efficiently in parallel, obtaining efficient implementations for irregular graph problems remains a challenge. We propose techniques for designing and implementing efficient parallel algorithms for graph problems on symmetric multiprocessors and chip multiprocessors with a case study of parallel tree and connectivity algorithms. The problems we study represent a wide range of irregular problems that have fast theoretic parallel algorithms but no known efficient parallel implementations that achieve speedup without serious restricting assumptions about the inputs. We believe our techniques will be of practical impact in solving largescale graph problems.

A new parallel algorithm for connected components in dynamic graphs

20th Annual International Conference on High Performance Computing, 2013

Social networks, communication networks, business intelligence databases, and large scientific data sources now contain hundreds of millions elements with billions of relationships. The relationships in these massive datasets are changing at ever-faster rates. Through representing these datasets as dynamic and semantic graphs of vertices and edges, it is possible to characterize the structure of the relationships and to quickly respond to queries about how the elements in the set are connected. Statically computing analytics on snapshots of these dynamic graphs is frequently not fast enough to provide current and accurate information as the graph changes. This has led to the development of dynamic graph algorithms that can maintain analytic information without resorting to full static recomputation.

Efficient Parallel Graph Algorithms for Coarse-Grained Multicomputers and BSP

Algorithmica, 2002

In this paper, we present deterministic parallel algorithms for the coarse grained multicomputer (CGM) and bulk-synchronous parallel computer (BSP) models which solve the following well known graph problems: (1) list ranking, (2) Euler tour construction, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, (7) 2-edge connectivity and biconnectivity (testing and component computation), and (8) cordal graph recognition ( nding a perfect elimination ordering). The algorithms for Problems 1-7 require O(log p) communication rounds and linear sequential work per round. Our results for Problems 1 and 2 hold for arbitrary ratios n p , i.e. they are fully scalable, and for Problems 3-8 it is assumed that n p p , > 0, which is true for all commercially available multiprocessors. We view the algorithms presented as an important step towards the nal goal of O(1) communication rounds. Note that, the number of communication rounds obtained in this paper is independent of n and grows only very slowly with respect to p. Hence, for most practical purposes, the number of communication rounds can be considered as constant. The result for Problem 1 is a considerable improvement over those previously reported. The algorithms for Problems 2-7 are the rst practically relevant deterministic parallel algorithms for these problems to be used for commercially available coarse grained parallel machines. ? Research partially supported by the Natural Sciences and Engineering Research Council of Canada, FAPESP (Brasil), CNPq (Brasil), PROTEM-2-TCPAC (Brasil), the Commission of the European Communities (ESPRIT Long Term Research Project 20244, ALCOM-IT), DFG-SFB 376 \Massive Parallelit at" (Germany), and the R egion Rhône-Alpes (France).

Large Graph Algorithms for Massively Multithreaded Architectures

2009

The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively mul...

Graph Reachability on Parallel Many-Core Architectures

Computation, 2020

Many modern applications are modeled using graphs of some kind. Given a graph, reachability, that is, discovering whether there is a path between two given nodes, is a fundamental problem as well as one of the most important steps of many other algorithms. The rapid accumulation of very large graphs (up to tens of millions of vertices and edges) from a diversity of disciplines demand efficient and scalable solutions to the reachability problem. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize algorithms that present a high degree of regularity. In this paper, we extend the applicability of GPU processing to graph-based manipulation, by re-designing a simple but efficient state-of-the-art graph-labeling method, namely the GRAIL (Graph Reachability Indexing via RAndomized Interval) algorithm, to many-core CUDA-based GPUs. This algorithm firstly generates a label for each vertex of the graph, then it exploits these labels to answer...