Anshul Gupta - Academia.edu (original) (raw)
Papers by Anshul Gupta
Algorithms for frequent pattern mining, a popular informatics application, have unique requiremen... more Algorithms for frequent pattern mining, a popular informatics application, have unique requirements that are not met by any of the existing parallel tools. In particular, such applications operate on extremely large data sets and have irregular memory access patterns. For efficient parallelization of such applications, it is necessary to support dynamic load balancing along with scheduling mechanisms that allow users to exploit data locality. Given these requirements, task parallelism is the most promising of the available parallel programming models. However, existing solutions for task parallelism schedule tasks implicitly and hence, custom scheduling policies that can exploit data locality cannot be easily employed. In this paper we demonstrate and characterize the speedup obtained in a frequent pattern mining application using a custom clustered scheduling policy in place of the popular Cilk-style policy. We present PFunc, a novel task parallel library whose customizable task sc...
Lecture Notes in Computer Science, 1999
We consider recursive algorithms for symmetric indefinite linear systems. First, the difficulties... more We consider recursive algorithms for symmetric indefinite linear systems. First, the difficulties with the recursive formulation of the LAPACK SYSV algorithm (which implements the Bunch-Kaufman pivoting strategy) are discussed. Next a recursive perturbation based algorithm is proposed and tested. Our experiments show that the new algorithm can be about two times faster although performing about the same number of flops as the LAPACK algorithm.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009
SIAM Journal on Scientific Computing, 2010
Report for early dissemination of its contents. In view of the transfer of copyright to the outsi... more Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T.
Proceedings of the 2000 international symposium on Symbolic and algebraic computation, 2000
Journal of the ACM, 2015
Asynchronous methods for solving systems of linear equations have been researched since Chazan an... more Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker's [1969] pioneering paper on chaotic relaxation. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to make progress even if not all progress made by other processors has been communicated to them. Historically, the applicability of asynchronous methods for solving linear equations has been limited to certain restricted classes of matrices, such as diagonally dominant matrices. Furthermore, analysis of these methods focused on proving convergence in the limit. Comparison of the asynchronous convergence rate with its synchronous counterpart and its scaling with the number of processors have seldom been studied and are still not well understood. In this article, we propose a randomized shared-memory asynchronous method for general symmetric positive definite matrices. We rigorously analyze the convergence rate a...
Asynchronous methods for solving systems of linear equations have been researched since Chazan an... more Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker's pioneering 1969 paper on chaotic relaxation. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to make progress even if not all progress made by other processors has been communicated to them. Historically, the applicability of asynchronous methods for solving linear equations was limited to certain restricted classes of matrices, such as diagonally dominant matrices. Furthermore, analysis of these methods focused on proving convergence in the limit. Comparison of the asynchronous convergence rate with its synchronous counterpart and its scaling with the number of processors were seldom studied, and are still not well understood. In this paper, we propose a randomized shared-memory asynchronous method for general symmetric positive definite matrices. We rigorously analyze the convergence rate and prove that it is linear, and is close to that of the method's synchronous counterpart if the processor count is not excessive relative to the size and sparsity of the matrix. We also present an algorithm for unsymmetric systems and overdetermined least-squares. Our work presents a significant improvement in the applicability of asynchronous linear solvers as well as in their convergence analysis, and suggests randomization as a key paradigm to serve as a foundation for asynchronous methods.
Lecture Notes in Computational Science and Engineering
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
Direct methods for solving sparse linear systems are robust and typically exhibit good performanc... more Direct methods for solving sparse linear systems are robust and typically exhibit good performance, but often require large amounts of memory due to fill-in. Many industrial applications use out-of-core techniques to mitigate this problem. However, parallelizing sparse out-of-core solvers poses some unique challenges because accessing secondary storage introduces serialization and I/O overhead. We analyze the data-movement costs and memory versus parallelism trade-offs in a sharedmemory parallel out-of-core linear solver for sparse symmetric systems. We propose an algorithm that uses a novel memory management scheme and adaptive task parallelism to reduce the data-movement costs. We present experiments to show that our solver is faster than existing out-of-core sparse solvers on a single core, and is more scalable than the only other known sharedmemory parallel out-of-core solver. This work is also directly applicable at the node level in a distributed-memory parallel scenario.
2008 Eighth IEEE International Conference on Data Mining, 2008
Lecture Notes in Computer Science, 2001
During the past few years, algorithmic improvements alone have shaved almost an order of magnitud... more During the past few years, algorithmic improvements alone have shaved almost an order of magnitude off the time required for the direct solution of general sparse systems of linear equations. Combined with a similar increase in the performance to cost ratio due to hardware advances during this period, current sparse solver technology makes it possible to solve those problems quickly and easily that might have been considered impractically large until recently. In this paper, we compare the performance of some commonly used software packages for solving general sparse systems. In particular, we demonstrate the consistently high level of performance achieved by WSMP-the most recent of such solvers. We compare the various algorithmic components of these solvers and show that the choices made in WSMP enable it to run two to three times faster than the best amongst other similar solvers. As a result, WSMP can factor some of the largest sparse matrices available from real applications in a few seconds on 4-CPU workstation.
Applied Optimization, 1997
Proceedings of the 5th international conference on Supercomputing - ICS '91, 1991
The scalability of a parallel algorithm on a parallel architecture is a measure of its capability... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capability to effectively utilize an increasing number of processors. The scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size it may be used to determine the optimal number of processors to be used and the maximum possible speedup for that problem size. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying the scalability issues, and discuss their interrelationships.
SIAM Journal on Matrix Analysis and Applications, 2002
We present algorithms for the symbolic and numerical factorization phases in the direct solution ... more We present algorithms for the symbolic and numerical factorization phases in the direct solution of sparse unsymmetric systems of linear equations. We have modi ed a classical symbolic factorization algorithm for unsymmetric matrices to inexpensively compute minimal elimination structures. We give an e cient algorithm to compute a near-minimal data-dependency graph that is valid irrespective of the amount of dynamic pivoting performed during numerical factorization. Finally, we describe an unsymmetric-pattern multifrontal algorithm for Gaussian elimination with partial pivoting that uses the task-and data-dependency graphs computed during the symbolic phase. These algorithms have been implemented in WSMP|an industrial strength sparse solver package|and have enabled WSMP to signi cantly outperform other similar solvers. We present experimental results to demonstrate the merits of the new algorithms.
Journal of Parallel and Distributed Computing, 1994
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity t... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under di erent constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a xed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoe ciency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.
Journal of Computational and Applied Mathematics, 2013
Incomplete LDL * factorizations sometimes produce an indenite preconditioner even when the input ... more Incomplete LDL * factorizations sometimes produce an indenite preconditioner even when the input matrix is Hermitian positive denite. The two most popular iterative solvers for symmetric systems, CG and MINRES, cannot use such preconditioners; they require a positive denite preconditioner. One approach, that has been extensively studied to address this problem is to force positive deniteness by modifying the factorization process. We explore a dierent approach: use the incomplete factorization with a Krylov method that can accept an indenite preconditioner. The conventional wisdom has been that long recurrence methods (like GMRES), or alternatively non-optimal short recurrence methods (like symmetric QMR and BiCGStab) must be used if the preconditioner is indenite. We explore the performance of these methods when used with an incomplete factorization, but also explore a less known Krylov method called PCG-ODIR that is both optimal and uses a short recurrence and can use an indenite preconditioner. Furthermore, we propose another optimal short recurrence method called IP-MINRES that can use an indenite preconditioner, and a variant of PCG-ODIR, which we call IP-CG, that is more numerically stable and usually requires fewer iterations.
IEEE Transactions on Parallel and Distributed Systems, 1995
IEEE Transactions on Parallel and Distributed Systems, 1997
IBM Journal of Research and Development, 1997
Graph partitioning is a fundamental problem in several scientific and engineering applications. I... more Graph partitioning is a fundamental problem in several scientific and engineering applications. In this paper, we describe heuristics that improve the state-of-the-art practical algorithms used in graph-partitioning software in terms of both partitioning speed and quality. An important use of graph partitioning is in ordering sparse matrices for obtaining direct solutions to sparse systems of linear equations arising in engineering and optimization applications. The experiments reported in this paper show that the use of these heuristics results in a considerable improvement in the quality of sparse-matrix orderings over conventional ordering methods, especially for sparse matrices arising in linear programming problems. In addition, our graph-partitioningbased ordering algorithm is more parallelizable than minimum-degree-based ordering algorithms, and it renders the ordered matrix more amenable to parallel factorization.
Algorithms for frequent pattern mining, a popular informatics application, have unique requiremen... more Algorithms for frequent pattern mining, a popular informatics application, have unique requirements that are not met by any of the existing parallel tools. In particular, such applications operate on extremely large data sets and have irregular memory access patterns. For efficient parallelization of such applications, it is necessary to support dynamic load balancing along with scheduling mechanisms that allow users to exploit data locality. Given these requirements, task parallelism is the most promising of the available parallel programming models. However, existing solutions for task parallelism schedule tasks implicitly and hence, custom scheduling policies that can exploit data locality cannot be easily employed. In this paper we demonstrate and characterize the speedup obtained in a frequent pattern mining application using a custom clustered scheduling policy in place of the popular Cilk-style policy. We present PFunc, a novel task parallel library whose customizable task sc...
Lecture Notes in Computer Science, 1999
We consider recursive algorithms for symmetric indefinite linear systems. First, the difficulties... more We consider recursive algorithms for symmetric indefinite linear systems. First, the difficulties with the recursive formulation of the LAPACK SYSV algorithm (which implements the Bunch-Kaufman pivoting strategy) are discussed. Next a recursive perturbation based algorithm is proposed and tested. Our experiments show that the new algorithm can be about two times faster although performing about the same number of flops as the LAPACK algorithm.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009
SIAM Journal on Scientific Computing, 2010
Report for early dissemination of its contents. In view of the transfer of copyright to the outsi... more Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T.
Proceedings of the 2000 international symposium on Symbolic and algebraic computation, 2000
Journal of the ACM, 2015
Asynchronous methods for solving systems of linear equations have been researched since Chazan an... more Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker's [1969] pioneering paper on chaotic relaxation. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to make progress even if not all progress made by other processors has been communicated to them. Historically, the applicability of asynchronous methods for solving linear equations has been limited to certain restricted classes of matrices, such as diagonally dominant matrices. Furthermore, analysis of these methods focused on proving convergence in the limit. Comparison of the asynchronous convergence rate with its synchronous counterpart and its scaling with the number of processors have seldom been studied and are still not well understood. In this article, we propose a randomized shared-memory asynchronous method for general symmetric positive definite matrices. We rigorously analyze the convergence rate a...
Asynchronous methods for solving systems of linear equations have been researched since Chazan an... more Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker's pioneering 1969 paper on chaotic relaxation. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to make progress even if not all progress made by other processors has been communicated to them. Historically, the applicability of asynchronous methods for solving linear equations was limited to certain restricted classes of matrices, such as diagonally dominant matrices. Furthermore, analysis of these methods focused on proving convergence in the limit. Comparison of the asynchronous convergence rate with its synchronous counterpart and its scaling with the number of processors were seldom studied, and are still not well understood. In this paper, we propose a randomized shared-memory asynchronous method for general symmetric positive definite matrices. We rigorously analyze the convergence rate and prove that it is linear, and is close to that of the method's synchronous counterpart if the processor count is not excessive relative to the size and sparsity of the matrix. We also present an algorithm for unsymmetric systems and overdetermined least-squares. Our work presents a significant improvement in the applicability of asynchronous linear solvers as well as in their convergence analysis, and suggests randomization as a key paradigm to serve as a foundation for asynchronous methods.
Lecture Notes in Computational Science and Engineering
2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012
Direct methods for solving sparse linear systems are robust and typically exhibit good performanc... more Direct methods for solving sparse linear systems are robust and typically exhibit good performance, but often require large amounts of memory due to fill-in. Many industrial applications use out-of-core techniques to mitigate this problem. However, parallelizing sparse out-of-core solvers poses some unique challenges because accessing secondary storage introduces serialization and I/O overhead. We analyze the data-movement costs and memory versus parallelism trade-offs in a sharedmemory parallel out-of-core linear solver for sparse symmetric systems. We propose an algorithm that uses a novel memory management scheme and adaptive task parallelism to reduce the data-movement costs. We present experiments to show that our solver is faster than existing out-of-core sparse solvers on a single core, and is more scalable than the only other known sharedmemory parallel out-of-core solver. This work is also directly applicable at the node level in a distributed-memory parallel scenario.
2008 Eighth IEEE International Conference on Data Mining, 2008
Lecture Notes in Computer Science, 2001
During the past few years, algorithmic improvements alone have shaved almost an order of magnitud... more During the past few years, algorithmic improvements alone have shaved almost an order of magnitude off the time required for the direct solution of general sparse systems of linear equations. Combined with a similar increase in the performance to cost ratio due to hardware advances during this period, current sparse solver technology makes it possible to solve those problems quickly and easily that might have been considered impractically large until recently. In this paper, we compare the performance of some commonly used software packages for solving general sparse systems. In particular, we demonstrate the consistently high level of performance achieved by WSMP-the most recent of such solvers. We compare the various algorithmic components of these solvers and show that the choices made in WSMP enable it to run two to three times faster than the best amongst other similar solvers. As a result, WSMP can factor some of the largest sparse matrices available from real applications in a few seconds on 4-CPU workstation.
Applied Optimization, 1997
Proceedings of the 5th international conference on Supercomputing - ICS '91, 1991
The scalability of a parallel algorithm on a parallel architecture is a measure of its capability... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capability to effectively utilize an increasing number of processors. The scalability analysis may be used to select the best algorithm-architecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size it may be used to determine the optimal number of processors to be used and the maximum possible speedup for that problem size. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying the scalability issues, and discuss their interrelationships.
SIAM Journal on Matrix Analysis and Applications, 2002
We present algorithms for the symbolic and numerical factorization phases in the direct solution ... more We present algorithms for the symbolic and numerical factorization phases in the direct solution of sparse unsymmetric systems of linear equations. We have modi ed a classical symbolic factorization algorithm for unsymmetric matrices to inexpensively compute minimal elimination structures. We give an e cient algorithm to compute a near-minimal data-dependency graph that is valid irrespective of the amount of dynamic pivoting performed during numerical factorization. Finally, we describe an unsymmetric-pattern multifrontal algorithm for Gaussian elimination with partial pivoting that uses the task-and data-dependency graphs computed during the symbolic phase. These algorithms have been implemented in WSMP|an industrial strength sparse solver package|and have enabled WSMP to signi cantly outperform other similar solvers. We present experimental results to demonstrate the merits of the new algorithms.
Journal of Parallel and Distributed Computing, 1994
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity t... more The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithm-architecture combination for a problem under di erent constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a xed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms that have been developed for studying scalability issues, and discuss their interrelationships. For example, we derive an important relationship between time-constrained scaling and the isoe ciency function. We point out some of the weaknesses of the existing schemes for measuring scalability, and discuss possible ways of extending them.
Journal of Computational and Applied Mathematics, 2013
Incomplete LDL * factorizations sometimes produce an indenite preconditioner even when the input ... more Incomplete LDL * factorizations sometimes produce an indenite preconditioner even when the input matrix is Hermitian positive denite. The two most popular iterative solvers for symmetric systems, CG and MINRES, cannot use such preconditioners; they require a positive denite preconditioner. One approach, that has been extensively studied to address this problem is to force positive deniteness by modifying the factorization process. We explore a dierent approach: use the incomplete factorization with a Krylov method that can accept an indenite preconditioner. The conventional wisdom has been that long recurrence methods (like GMRES), or alternatively non-optimal short recurrence methods (like symmetric QMR and BiCGStab) must be used if the preconditioner is indenite. We explore the performance of these methods when used with an incomplete factorization, but also explore a less known Krylov method called PCG-ODIR that is both optimal and uses a short recurrence and can use an indenite preconditioner. Furthermore, we propose another optimal short recurrence method called IP-MINRES that can use an indenite preconditioner, and a variant of PCG-ODIR, which we call IP-CG, that is more numerically stable and usually requires fewer iterations.
IEEE Transactions on Parallel and Distributed Systems, 1995
IEEE Transactions on Parallel and Distributed Systems, 1997
IBM Journal of Research and Development, 1997
Graph partitioning is a fundamental problem in several scientific and engineering applications. I... more Graph partitioning is a fundamental problem in several scientific and engineering applications. In this paper, we describe heuristics that improve the state-of-the-art practical algorithms used in graph-partitioning software in terms of both partitioning speed and quality. An important use of graph partitioning is in ordering sparse matrices for obtaining direct solutions to sparse systems of linear equations arising in engineering and optimization applications. The experiments reported in this paper show that the use of these heuristics results in a considerable improvement in the quality of sparse-matrix orderings over conventional ordering methods, especially for sparse matrices arising in linear programming problems. In addition, our graph-partitioningbased ordering algorithm is more parallelizable than minimum-degree-based ordering algorithms, and it renders the ordered matrix more amenable to parallel factorization.