Michelle Strout | Colorado State University (original) (raw)

Papers by Michelle Strout

This paper describes an approach to performance optimization using modified macro dataflow graphs... more This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks. CCS Concepts • Computing methodologies → Parallel computing methodologies; Parallel algorithms; • Software and its engineering → Compilers; Macro languages;

Lecture Notes in Computer Science, 2013

The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted fun... more The Sparse Polyhedral Framework (SPF) extends the Polyhedral Model by using the uninterpreted function call abstraction for the compile-time specification of run-time reordering transformations such as loop and data reordering and sparse tiling approaches that schedule irregular sets of iteration across loops. The Polyhedral Model represents sets of iteration points in imperfectly nested loops with unions of polyhedral and represents loop transformations with affine functions applied to such polyhedra sets. Existing tools such as ISL, Cloog, and Omega manipulate polyhedral sets and affine functions, however the ability to represent the sets and functions where some of the constraints include uninterpreted function calls such as those needed in the SPF is non-existant or severely restricted. This paper presents algorithms for manipulating sets and relations with uninterpreted function symbols to enable the Sparse Polyhedral Framework. The algorithms have been implemented in an open source, C++ library called IEGenLib (The Inspector/Executor Generator Library).

This paper presents a combined compile-time and runtime loop-carried dependence analysis of spars... more This paper presents a combined compile-time and runtime loop-carried dependence analysis of sparse matrix codes and evaluates its performance in the context of wavefront parallellism. Sparse computations incorporate indirect memory accesses such as x[col[j]] whose memory locations cannot be determined until runtime. The key contributions of this paper are two compile-time techniques for significantly reducing the overhead of runtime dependence testing: (1) identifying new equality constraints that result in more efficient runtime inspectors, and (2) identifying subset relations between dependence constraints such that one dependence test subsumes another one that is therefore eliminated. New equality constraints discovery is enabled by taking advantage of domain-specific knowledge about index arrays, such as col[j]. These simplifications lead to automatically-generated inspectors that make it practical to parallelize such computations. We analyze our simplification methods for a collection

Stencil computations figure prominently in the core kernels of many scientific computations, such... more Stencil computations figure prominently in the core kernels of many scientific computations, such as partial differential equation solvers. Parallel scaling of stencil computations can be significantly improved on multicore processors using advanced tiling techniques that include the time dimension, such as diamond tiling. Such techniques are difficult to include in general purpose optimizing compilers because of the need for inter-procedural pointer and array data-flow analysis, plus the need to tune scheduling strategies and tile size parameters for each pairing of stencil computation and machine. Since a fully automatic solution is problematic, we propose to provide parameterized space and time tiling iterators through libraries. Ideally, the execution schedule or tiling code will be expressed orthogonally to the computation. This supports code reuse, easier tuning, and improved programmer productivity. Chapel iterators provide this capability implicitly. We present an advanced, parameterized tiling approach that we have implemented using Chapel parallel iterators. We show how such iterators can be used by programmers in stencil computations with multiple spatial dimensions. We also demonstrate that these new iterators provide better scaling than a traditional data parallel schedule.

Sigplan Notices, May 9, 2003

Proceedings of the IEEE, Nov 1, 2018

Irregular applications such as big graph analysis, material simulations, molecular dynamics simul... more Irregular applications such as big graph analysis, material simulations, molecular dynamics simulations, and finite element analysis have performance problems due to their use of sparse data structures. Inspector-executor strategies improve sparse computation performance through parallelization and data locality optimizations. An inspector reschedules and reorders data at runtime, and an executor is a transformed version of the original computation that uses the newly reorganized schedules and data structures. Inspector-executor transformations are commonly written in a domain-specific or even application-specific fashion. Significant progress has been made in incorporating such inspector-executor transformations into existing compiler transformation frameworks, thus enabling their use with compile-time transformations. However, composing inspector-executor transformations in a general way has only been done in the context of the Sparse Polyhedral Framework (SPF). Though SPF enables the general composition of such transformations, the resulting inspector and executor performance suffers due to missed specialization opportunities. This paper reviews the history and current state of the art for inspector-executor strategies and reviews how the SPF enables the composition of inspector-executor transformations. Further, it describes a research vision to combine this generality in SPF with specialization to achieve composable and high performance inspectors and executors, producing a powerful compiler framework for sparse matrix computations.

Message passing via MPI is widely used in single-program, multiple-data (SPMD) parallel programs.... more Message passing via MPI is widely used in single-program, multiple-data (SPMD) parallel programs. Data-flow analysis frameworks that respect the semantics of message-passing SPMD programs are needed to obtain more accurate and in some cases correct analysis results for such programs. We qualitatively evaluate various approaches for performing data-flow analysis on SPMD MPI programs and present a method for performing interprocedural data-flow analysis on the MPI-ICFG representation. The MPI-ICFG is an interprocedural control-flow graph (ICFG) augmented with communication edges between possible send and receive pairs.

There is a significant, established code base in the scientific computing community. Some of thes... more There is a significant, established code base in the scientific computing community. Some of these codes have been parallelized already but are now encountering scalability issues due to poor data locality, inefficient data distributions, or load imbalance. In this work, we introduce a new abstraction called loop chaining in which a sequence of parallel and/or reduction loops that explicitly share data are grouped together into a chain. Once specified, a chain of loops can be viewed as a set of iterations under a partial ordering. This partial ordering is dictated by data dependencies that, as part of the abstraction, are exposed, thereby avoiding inter-procedural program analysis. Thus a loop chain is a partially ordered set of iterations that makes scheduling and determining data distributions across loops possible for a compiler and/or run-time system. The flexibility of being able to schedule across loops enables better management of the data locality and parallelism tradeoff. In this paper, we define the loop chaining concept and present three case studies using loop chains in scientific codes: the sparse matrix Jacobi benchmark, a domain-specific library, OP2, used in full applications with unstructured grids, and a domain-specific library, Chombo, used in full applications with structured grids. Preliminary results for the Jacobi benchmark show that a loop chain enabled optimization, full sparse tiling, results in a speedup of as much as 2.68x over a parallelized, blocked implementation on a multicore system with 40 cores.

Springer eBooks, 2001

In modern computer architecture the use of memory hierarchies causes a program's data locality to... more In modern computer architecture the use of memory hierarchies causes a program's data locality to directly affect performance. Data locality occurs when a piece of data is still in a cache upon reuse. For dense matrix computations, loop transformations can be used to improve data locality. However, sparse matrix computations have non-affine loop bounds and indirect memory references which prohibit the use of compile time loop transformations. This paper describes an algorithm to tile at runtime called serial sparse tiling. We test a runtime tiled version of sparse Gauss-Seidel on 4 different architectures where it exhibits speedups of up to 2.7. The paper also gives a static model for determining tile size and outlines how overhead affects the overall speedup.

Artifact Digital Object Group, Feb 17, 2023

Electronic Notes in Theoretical Computer Science, May 1, 2007

Automatic differentiation is a technique for the rule-based transformation of a subprogram that c... more Automatic differentiation is a technique for the rule-based transformation of a subprogram that computes some mathematical function into a subprogram that computes the derivatives of that function. Automatic differentiation algorithms are typically expressed as operating on a weighted term graph called a linearized computational graph. Constructing this weighted term graph for imperative programming languages such as C/C++ and Fortran introduces several challenges. Alias and definition-use information is needed to construct term graphs for individual statements and then combine them into one graph for a collection of statements. Furthermore, the resulting weighted term graph must be represented in a language-independent fashion to enable the use of AD algorithms in tools for various languages. We describe the construction and representation of weighted term graphs for C/C++ and Fortran, as implemented in the ADIC 2.0 and OpenAD/F tools for automatic differentiation.

International Journal of High Performance Computing and Networking, 2019

Exposing opportunities for parallelisation while explicitly managing data locality is the primary... more Exposing opportunities for parallelisation while explicitly managing data locality is the primary challenge to porting and optimising computational science simulation codes to improve performance. OpenMP provides mechanisms for expressing parallelism, but it remains the programmer's responsibility to group computations to improve data locality. The loop chain abstraction, where a summary of data access patterns is included as pragmas associated with parallel loops, provides compilers with sufficient information to automate the parallelism versus data locality trade-off. We present the syntax and semantics of loop chain pragmas for indicating information about loops belonging to the loop chain and specification of a high-level schedule for the loop chain. We show example usage of the pragmas, detail attempts to automate the transformation of a legacy scientific code written with specific language constraints to loop chain codes, describe the compiler implementation for loop chain pragmas, and exhibit performance results for a computational fluid dynamics benchmark.

Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of ce... more Structured-grid PDE solver frameworks parallelize over boxes, which are rectangular domains of cells or faces in a structured grid. In the Chombo framework, the box sizes are typically 16 3 or 32 3 , but larger box sizes such as 128 3 would result in less surface area and therefore less storage, copying, and/or ghost cells communication overhead. Unfortunately, current onnode parallelization schemes perform poorly for these larger box sizes. In this paper, we investigate 30 different inter-loop optimization strategies and demonstrate the parallel scaling advantages of some of these variants on NUMA multicore nodes. Shifted, fused, and communication-avoiding variants for 128 3 boxes result in close to ideal parallel scaling and come close to matching the performance of 16 3 boxes on three different multicore systems for a benchmark that is a proxy for program idioms found in Computational Fluid Dynamic (CFD) codes.

International Journal of High Performance Computing Applications, Feb 1, 2004

Data-flow analyses that include some model of the data-flow between MPI sends and receives result... more Data-flow analyses that include some model of the data-flow between MPI sends and receives result in improved precision in the analysis results. One issue that arises with performing data-flow analyses on MPI programs is that the interprocedural control-flow graph ICFG is often irreducible due to call and return edges, and the MPI-ICFG adds further irreducibility due to communication edges. To provide an upper bound for iterative data-flow analysis complexity, one needs to have a general idea as to the depth of common flow graphs. Unfortunately, computing the depth of an irreducible graph is NP-complete. In this paper, we compare the depth-based iteration bounds for several MPI benchmarks with the actual number of iterations required for two data-flow analyses. We are able to compute the depth despite the worst-case exponential complexity of the depth-analysis algorithm by first reducing the reducible parts of the flow graph and then explore paths in the reduced graph. Our results show that on average the reduced graphs are 80% smaller than the original graphs and have 90% fewer cycle-free paths, resulting in a 10x faster algorithm. We also observe that although the number of iterations over the flow graph is bounded by the lattice height multiplied by the graph depth, the graph depth is clearly the dominating factor and provides a close approximation to the complexity of iterative data-flow analysis over MPI programs.

arXiv (Cornell University), Aug 24, 2022

This paper presents a code generator for sparse tensor contraction computations. It leverages a m... more This paper presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine computations, such as arise in sparse tensors. SPF is extended to perform layout specification, optimization, and code generation of sparse tensor code: 1) we develop a polyhedral layout specification that decouples iteration spaces for layout and computation; and, 2) we develop efficient co-iteration of sparse tensors by combining polyhedra scanning over the layout of one sparse tensor with the synthesis of code to find corresponding elements in other tensors through an SMT solver. We compare the generated code with that produced by a state-of-the-art tensor compiler, TACO. We achieve on average 1.63× faster parallel performance than TACO on sparse-sparse co-iteration and describe how to improve that to 2.72× average speedup by switching the find algorithms. We also demonstrate that decoupling iteration spaces of layout and computation enables additional layout and computation combinations to be supported. CCS Concepts: • Software and its engineering → Source code generation; Domain specific languages.

Sigplan Notices, Oct 1, 1998

This paper studies the relationship between storage requirements and performance. Storage-related... more This paper studies the relationship between storage requirements and performance. Storage-related dependences inhibit optimizations for locality and parallelism. Techniques such as renaming and array expansion can eliminate all storage-related dependences, but do so at the expense of increased storage. This paper introduces the universa.1 occupancy vector (UOV) for loops with a regular stencil of dependences. The UOV provides a schedule-independent storage reuse pattern that introduces no further dependences (other than those implied by true flow dependences). OV-mapped code requires less storage than full array expansion and only slightly more storage than schedule-dependent minimal storage. We show that determine if a vector is a UOV is NPcomplete. However, an easily constructed but possibly nonminimal UOV can be used. We also present a branch and bound algorithm which finds the minimal UOV, while still maintaining a legal UOV at all times. Our experimental results show that the use of OV-mapped stora.ge, coupled with tiling for locality, achieves better performance than tiling after array expansion, and accommodates larger problem sizes than untilable, storage-optimized code. F'urthermore, storage mapping based on the UOV introduces negligible runtime overhead. 'also known as anti-and output dependences [27]

Lecture Notes in Computer Science, 2005