Accelerating Spectral Graph Analysis Through Wavefronts of Linear Algebra Operations (original) (raw)
Related papers
Fast Spectral Graph Layout on Multicore Platforms
49th International Conference on Parallel Processing - ICPP, 2020
We present ParHDE, a shared-memory parallelization of the High-Dimensional Embedding (HDE) graph algorithm. Originally proposed as a graph drawing algorithm, HDE characterizes the global structure of a graph and is closely related to spectral graph computations such as computing the eigenvectors of the graph Laplacian. We identify compute-and memory-intensive steps in HDE and parallelize these steps for efficient execution on shared-memory multicore platforms. ParHDE can process graphs with billions of edges in minutes, is up to 18× faster than a prior parallel implementation of HDE, and achieves up to a 24× relative speedup on a 28-core system. We also implement several extensions of ParHDE and demonstrate its utility in diverse graph computation-related applications. CCS CONCEPTS • Human-centered computing → Graph drawings; • Computing methodologies → Spectral methods; • Theory of computation → Shared memory algorithms.
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
ACM Transactions on Mathematical Software
High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which allow graph algorithms to be expressed in a performant, succinct, composable, and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity , which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of...
Transforming linear algebra libraries: From abstraction to parallelism
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
We have built a body of evidence which shows that, given a mathematical specification of a dense linear algebra operation to be implemented, it is possible to mechanically derive families of algorithms and subsequently to mechanically translate these algorithms into high-performing code. In this paper, we add to this evidence by showing that the algorithms can be statically analyzed and translated into directed acyclic graphs (DAGs) of coarse-grained operations that are to be performed. DAGs naturally express parallelism, which we illustrate by representing the DAGs with the G graphical programming language used by LabVIEW. The LabVIEW compiler and runtime execution system then exploit parallelism from the resulting code. Respectable speedup on a sixteen core architecture is reported.
Design of a Large-Scale Hybrid-Parallel Graph Library
The focus of traditional scientific computing has been in solving large systems of PDEs (and the corresponding linear algebra problems that they induce). Hardware architectures, computer systems, and software platforms have evolved together to efficiently support solving these kinds of problems. Similar attention has not been devoted to solving large-scale graph problems. Recently this class of applications has seen increased attention. The irregular, nonlocal, and dynamic characteristics of these problems require new programming techniques to adapt them to modern HPC systems offering multiple levels of parallelism. We describe a library for implementing graph algorithms based on asynchronous execution of fine-grained, concurrent operations. Prototype implementations of two graph kernels which combine lightweight graph metadata transactions with generalized active messages demonstrate that it is possible to implement graph applications which efficiently leverage both shared-and distributed-memory parallelism.
Automating Wavefront Parallelization for Sparse Matrix Computations
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis, 2016
This paper presents a compiler and runtime framework for parallelizing sparse matrix computations that have loopcarried dependences. Our approach automatically generates a runtime inspector to collect data dependence information and achieves wavefront parallelization of the computation, where iterations within a wavefront execute in parallel, and synchronization is required across wavefronts. A key contribution of this paper involves dependence simplification, which reduces the time and space overhead of the inspector. This is implemented within a polyhedral compiler framework, extended for sparse matrix codes. Results demonstrate the feasibility of using automaticallygenerated inspectors and executors to optimize ILU factorization and symmetric Gauss-Seidel relaxations, which are part of the Preconditioned Conjugate Gradient (PCG) computation. Our implementation achieves a median speedup of 2.97× on 12 cores over the reference sequential PCG implementation, significantly outperforms PCG parallelized using Intel's Math Kernel Library (MKL), and is within 6% of the median performance of manually-parallelized PCG.
P-HARP:a parallel dynamic spectral partitioner
1997
Computational science problems with adaptive meshes involve dynamic,load balancing when implemented on parallel machines. This dynamic,load balancing requires fast partitioning of computationalmeshes,at run time. We present in this report a fast parallel dynamic partitioner, called SHARP. The underlying principles of S-HARPare the fast feature of inertial partitioning and the quality feature of spectral partitioning. SHARP partitions a graph from scratch, requiring no partition information from previous iterations. Two types of parallelism have been exploited in SHARP, fine-grain loop-level parallelism and coarse-grain recursive parallelism. The parallel partitioner has been implemented,in Message Passing Interface on Cray T3E and IBM SP2 for portability. Experimental results indicate that SHARPcan partition a mesh of over 100,000 vertices into 256 partitions in 0.2 seconds on a 64-processor Cray T3E. SHARPis much,more scalable than other dynamic partitioners, giving over 15-fold sp...
2018
We propose FFTX, a new framework for building high-performance FFT-based applications on exascale machines. Complex node architectures lead to multiple levels of parallelism and demand efficient ways of data communication. The current FFTW interface falls short in maximizing performance in such scenarios. FFTX is designed to enable application developers to leverage expert-level, automatic optimizations while navigating a familiar interface. FFTX is backwards compatible to FFTW and extends the FFTW Interface into an embedded Domain Specific Language (DSL) expressed as a library interface. By means of a SPIRAL-based back end, this enables build-time source-to-source translation and advanced performance optimizations, such as cross-library calls optimizations, targeting of accelerators through offloading, and inlining of user-provided kernels. We demonstrate the use of FFTX with the prototypical example of 1D and 3D pruned convolutions and discuss future extensions. Keywords-FFT; exas...
Effecting parallel graph eigensolvers through library composition
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
Many interesting problems in graph theory can be reduced to solving an eigenproblem of the adjacency matrix or Laplacian of a graph. Given the availability of high-quality linear algebra and graph libraries, one might expect that one could merely use a graph data structure within a eigensolver. However, conventional libraries are rigidly constructed, requiring conversion to library-specific data structures or using heavyweight abstraction methods that prevent efficient composition.
2011
We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributed-memory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problem-size, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architecture-aware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nested-loops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA (Parallel Linear Algebra for Scalable Multi-core Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAG-based approach has the potential to achieve sizable fraction of peak performance which is characteristic of the state-of-the-art distributed numerical software on current and emerging architectures.
The PRISM Project: Infrastructure and Algorithms for Parallel Eigensolvers*
1997
The goal of the PRISM project is the development of infrastructure and algorithms for the parallel solution of eigenvalue problems. We are currently investigating a complete eigensolver based on the Invariant Subspace Decomposition Algorithm for dense symmetric matrices (SYISDA). After brie y reviewing SYISDA, we discuss the algorithmic highlights of a distributed-memory implementation of this approach. These include a fast matrix-matrix multiplication algorithm, a new approach to parallel band reduction and tridiagonalization, and a harness for coordinating the divide-and-conquer parallelism in the problem. We also present performance results of these kernels as well as the overall SYISDA implementation on the Intel Touchstone Delta prototype.