Kei Davis | Los Alamos National Laboratory (original) (raw)
Papers by Kei Davis
ABSTRACT ROSE represents a programmable preprocessor for the highly aggressive optimization of C+... more ABSTRACT ROSE represents a programmable preprocessor for the highly aggressive optimization of C++ object-oriented frameworks. A fundamental feature of ROSE is that it preserves the semantics, the implicit meaning, of the object-oriented framework's abstractions throughout the optimization process, permitting the framework's abstractions to be recognized and optimizations to capitalize upon the added value of the framework's true meaning. In contrast, a C++ compiler only sees the semantics of the C++ language and thus is severely limited in what optimizations it can introduce. The use of the semantics of the framework's abstractions avoids program analysis that would be incapable of recapturing the framework's full semantics from those of the C++ language implementation of the application or framework. Just as no level of program analysis within the C++ compiler would not be expected to recognize the use of adaptive mesh refinement and introduce optimizations based upon such information. Since ROSE is programmable, additional specialized program analysis is possible which then compliments the semantics of the framework's abstractions. Enabling an optimization mechanism to use the high level semantics of the framework's abstractions together with a programmable level of program analysis (e.g. dependence analysis), at the level of the framework's abstractions, allows for the design of high performance object-oriented frameworks with uniquely tailored sophisticated optimizations far beyond the limits of contemporary serial F0RTRAN 77, C or C++ language compiler technology. In short, faster, more highly aggressive optimizations are possible. The resulting optimizations are literally driven by the framework's definition of its abstractions. Since the abstractions within a framework are of third party design the optimizations are similarly of third party design, specifically independent of the compiler and the applications that use the framework. The interface to ROSE is particularly simple and takes advantage of standard compiler technology. ROSE acts like a preprocessor, since it must parse standard C++¹, and its use is optional, it can not be used to introduce any new language features. ROSE reads standard C++ source code and outputs standard C++ code. Its use is always optional, by design: so as not to interfere with and to remain consistent with the object-oriented framework. It is a mechanism to introduce optimizations only; adding language features using ROSE is by design no more possible than within the framework itself. Importantly, since ROSE generates C ++ code it does not preclude the use of other tools or mechanisms that would work with an application source code (including template mechanisms).
The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to d... more The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to determine which parts of a program's result is determined when some part of the input is known. Domain projections provide a very general way to encode a description of which parts of a data structure are static (known), and which are dynamic (not static). For rst-order functional languages Launchbury Lau91a] has developed an abstract interpretation technique for bindingtime analysis in which the basic abstract value is a projection. Unfortunately this technique does not generalise easily to higher-order languages. This paper develops such a generalisation: a projection-based abstract interpretation suitable for higher-order binding-time analysis.
In this work we present an initial performance evaluation of AMD and Intel's first quad-core proc... more In this work we present an initial performance evaluation of AMD and Intel's first quad-core processor offerings: the AMD Barcelona and the Intel Xeon X7350. We examine the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks and a range of largescale scientific applications indicates that quad-core processors can deliver an improvement in performance of up to 4x per processor but is heavily dependent on the workload being processed. While the Intel processor has a higher clock rate and peak performance, the AMD processor has higher memory bandwidth and intra-node scalability. The scientific applications we analyzed exhibit a range of performance improvements from only 3x up to the full 16x speed-up over a single core. Also, we note that the maximum node performance is not necessarily achieved by using all 16 cores.
We describe the software architecture, technical features, and performance of TICK (Transparent I... more We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.
Current disk prefetch policies in major operating systems track access patterns at the level of t... more Current disk prefetch policies in major operating systems track access patterns at the level of the file abstraction. While this is useful for exploiting application-level access patterns, file-level prefetching cannot realize the full performance improvements achievable by prefetching. There are two reasons for this. First, certain prefetch opportunities can only be detected by knowing the data layout on disk, such as the contiguous layout of file metadata or data from multiple files. Second, non-sequential access of disk data (requiring disk head movement) is much slower than sequential access, and the penalty for mis-prefetching a 'random' block, relative to that of a sequential block, is correspondingly more costly.
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed ... more As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency-requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.
This paper is structured as follows. Section 2 gives an architectural description of BlueGene/L. ... more This paper is structured as follows. Section 2 gives an architectural description of BlueGene/L. Section 3 analyzes the issue of "computational noise" -the effect that the operating system has on the system and application performance. Section 4 describes the performance characteristics of the communication networks. Section 5 deals with single processor performance. Section 6 addresses application performance and scalability, including performance prediction. Most of the results are taken from a 512-node machine running at 500MHz. Also included is a comparison of the predicted performance of BlueGene/L against the performance of ASCI Q and early results from a larger 2048 node BlueGene/L machine clocked at 700MHz. Finally the analysis is summarized in section 7.
The design and implementation of a high performance communication network are critical factors in... more The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a large-scale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. While this approach is not new, there are few studies that quantify the potential benefit of such overlapping for large-scale production scientific codes. We address this with an empirical method combined with a network model to quantify the potential overlap in several codes and examine the possible performance benefit. Our results demonstrate, for the codes examined, that a high potential tolerance to network latency and bandwidth exists because of a high degree of potential overlap. Moreover, our results indicate that there is often no need to use fine-grained communication mechanisms to achieve this benefit, since the major source of potential overlap is found in independent work-computation on which pending messages does not depend. This allows for a potentially significant relaxation of network requirements without a consequent degradation of application performance
Scalable management of distributed resources is one of the major challenges in deployment of larg... more Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel applications: parallel I/O, deterministic behavior, and responsiveness. These requirements are daunting with commodity hardware and operating systems since they were not designed to support a global, single management view of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.
IEEE Transactions on Computers, 2007
This paper proposes a protocol for effective coordinated buffer cache management in a multilevel ... more This paper proposes a protocol for effective coordinated buffer cache management in a multilevel cache hierarchy typical of a client/server system. Currently, such cache hierarchies are managed suboptimally-decisions about block placement and replacement are made locally at each level of the hierarchy without coordination between levels. Though straightforward, this approach has several weaknesses: 1) Blocks may be redundantly cached, reducing the effective total cache size, 2) weakened locality at lowerlevel caches makes recency-based replacement algorithms such as LRU less effective, and 3) high-level caches cannot effectively identify blocks with strong locality and may place them in low-level caches. The fundamental reason for these weaknesses is that the locality information embedded in the streams of access requests from clients is not consistently analyzed and exploited, resulting in globally nonsystematic, and therefore suboptimal, placement and replacement of cached blocks across the hierarchy. To address this problem, we propose a coordinated multilevel cache management protocol based on consistent access-locality quantification. In this protocol, locality is dynamically quantified at the client level to direct servers to place or replace blocks appropriately at each level of the cache hierarchy. The result is that the block layout in the entirely hierarchy dynamically matches the locality of block accesses. Our simulation experiments on both synthetic and real-life traces show that the protocol effectively ameliorates these caching problems. As anecdotal evidence, our protocol achieves a reduction of block accesses of 11 percent to 71 percent, with an average of 35 percent, over uniLRU, a unified multilevel cache scheme.
Checkpoint/restart is a general idea for which particular implementations enable various function... more Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant-and ultimately autonomic-large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.
Parallel Processing Letters, 2008
In this work we present an initial performance evaluation of Intel's latest, secondgeneration qua... more In this work we present an initial performance evaluation of Intel's latest, secondgeneration quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Intel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intensive codes by a factor of two for a Nehalem node with 8 cores and a Barcelona node containing 16 cores. Further optimizations are possible with Nehalem, including the use of Simultaneous Multithreading, which improves the performance of some applications by up to 50%.
Ever-increasing demand for computing capability is driving the construction of ever-larger comput... more Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, typically comprising commodity compute nodes, ranging in size up to thousands of processors, with each node hosting an instance of the operating system (OS). Recent studies have shown that even minimal intrusion by the OS on user applications, e.g. a slowdown of user processes of less than 1.0% on each OS instance, can result in a dramatic performance degradation-50% or more-when the user applications are executed on thousands of processors.
Ever-increasing demand for computing capability is driving the construction of ever-larger comput... more Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly-systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed ... more Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunner's hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture-the Cell BE-and on multicore processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.
The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to d... more The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to determine which parts of a program's result is determined when some part of the input is known. Domain projections provide a very general way to encode a description of which parts of a data structure are static (known), and which are dynamic (not static). For rst-order functional languages Launchbury Lau91a] has developed an abstract interpretation technique for bindingtime analysis in which the basic abstract value is a projection. Unfortunately this technique does not generalise easily to higher-order languages. This paper develops such a generalisation: a projection-based abstract interpretation suitable for higher-order binding-time analysis.
ABSTRACT ROSE represents a programmable preprocessor for the highly aggressive optimization of C+... more ABSTRACT ROSE represents a programmable preprocessor for the highly aggressive optimization of C++ object-oriented frameworks. A fundamental feature of ROSE is that it preserves the semantics, the implicit meaning, of the object-oriented framework's abstractions throughout the optimization process, permitting the framework's abstractions to be recognized and optimizations to capitalize upon the added value of the framework's true meaning. In contrast, a C++ compiler only sees the semantics of the C++ language and thus is severely limited in what optimizations it can introduce. The use of the semantics of the framework's abstractions avoids program analysis that would be incapable of recapturing the framework's full semantics from those of the C++ language implementation of the application or framework. Just as no level of program analysis within the C++ compiler would not be expected to recognize the use of adaptive mesh refinement and introduce optimizations based upon such information. Since ROSE is programmable, additional specialized program analysis is possible which then compliments the semantics of the framework's abstractions. Enabling an optimization mechanism to use the high level semantics of the framework's abstractions together with a programmable level of program analysis (e.g. dependence analysis), at the level of the framework's abstractions, allows for the design of high performance object-oriented frameworks with uniquely tailored sophisticated optimizations far beyond the limits of contemporary serial F0RTRAN 77, C or C++ language compiler technology. In short, faster, more highly aggressive optimizations are possible. The resulting optimizations are literally driven by the framework's definition of its abstractions. Since the abstractions within a framework are of third party design the optimizations are similarly of third party design, specifically independent of the compiler and the applications that use the framework. The interface to ROSE is particularly simple and takes advantage of standard compiler technology. ROSE acts like a preprocessor, since it must parse standard C++¹, and its use is optional, it can not be used to introduce any new language features. ROSE reads standard C++ source code and outputs standard C++ code. Its use is always optional, by design: so as not to interfere with and to remain consistent with the object-oriented framework. It is a mechanism to introduce optimizations only; adding language features using ROSE is by design no more possible than within the framework itself. Importantly, since ROSE generates C ++ code it does not preclude the use of other tools or mechanisms that would work with an application source code (including template mechanisms).
The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to d... more The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to determine which parts of a program's result is determined when some part of the input is known. Domain projections provide a very general way to encode a description of which parts of a data structure are static (known), and which are dynamic (not static). For rst-order functional languages Launchbury Lau91a] has developed an abstract interpretation technique for bindingtime analysis in which the basic abstract value is a projection. Unfortunately this technique does not generalise easily to higher-order languages. This paper develops such a generalisation: a projection-based abstract interpretation suitable for higher-order binding-time analysis.
In this work we present an initial performance evaluation of AMD and Intel's first quad-core proc... more In this work we present an initial performance evaluation of AMD and Intel's first quad-core processor offerings: the AMD Barcelona and the Intel Xeon X7350. We examine the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks and a range of largescale scientific applications indicates that quad-core processors can deliver an improvement in performance of up to 4x per processor but is heavily dependent on the workload being processed. While the Intel processor has a higher clock rate and peak performance, the AMD processor has higher memory bandwidth and intra-node scalability. The scientific applications we analyzed exhibit a range of performance improvements from only 3x up to the full 16x speed-up over a single core. Also, we note that the maximum node performance is not necessarily achieved by using all 16 cores.
We describe the software architecture, technical features, and performance of TICK (Transparent I... more We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifically designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.
Current disk prefetch policies in major operating systems track access patterns at the level of t... more Current disk prefetch policies in major operating systems track access patterns at the level of the file abstraction. While this is useful for exploiting application-level access patterns, file-level prefetching cannot realize the full performance improvements achievable by prefetching. There are two reasons for this. First, certain prefetch opportunities can only be detected by knowing the data layout on disk, such as the contiguous layout of file metadata or data from multiple files. Second, non-sequential access of disk data (requiring disk head movement) is much slower than sequential access, and the penalty for mis-prefetching a 'random' block, relative to that of a sequential block, is correspondingly more costly.
As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed ... more As the number of processors for multi-teraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. In this paper we will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency-requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.
This paper is structured as follows. Section 2 gives an architectural description of BlueGene/L. ... more This paper is structured as follows. Section 2 gives an architectural description of BlueGene/L. Section 3 analyzes the issue of "computational noise" -the effect that the operating system has on the system and application performance. Section 4 describes the performance characteristics of the communication networks. Section 5 deals with single processor performance. Section 6 addresses application performance and scalability, including performance prediction. Most of the results are taken from a 512-node machine running at 500MHz. Also included is a comparison of the predicted performance of BlueGene/L against the performance of ASCI Q and early results from a larger 2048 node BlueGene/L machine clocked at 700MHz. Finally the analysis is summarized in section 7.
The design and implementation of a high performance communication network are critical factors in... more The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a large-scale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. While this approach is not new, there are few studies that quantify the potential benefit of such overlapping for large-scale production scientific codes. We address this with an empirical method combined with a network model to quantify the potential overlap in several codes and examine the possible performance benefit. Our results demonstrate, for the codes examined, that a high potential tolerance to network latency and bandwidth exists because of a high degree of potential overlap. Moreover, our results indicate that there is often no need to use fine-grained communication mechanisms to achieve this benefit, since the major source of potential overlap is found in independent work-computation on which pending messages does not depend. This allows for a potentially significant relaxation of network requirements without a consequent degradation of application performance
Scalable management of distributed resources is one of the major challenges in deployment of larg... more Scalable management of distributed resources is one of the major challenges in deployment of large-scale clusters. Management includes transparent fault tolerance, efficient allocation of resources, and support for all the needs of parallel applications: parallel I/O, deterministic behavior, and responsiveness. These requirements are daunting with commodity hardware and operating systems since they were not designed to support a global, single management view of a large-scale system. In this paper we propose a small set of hardware mechanisms in the cluster interconnect to facilitate the implementation of a simple yet powerful global operating system. This system, which can be thought of as a coarse-grain SIMD operating system, allows commodity clusters to grow to thousands of nodes while still retaining the usability and responsiveness of the single-node workstation. Our results on a software prototype show that it is possible to implement efficient and scalable system software using the proposed set of mechanisms.
IEEE Transactions on Computers, 2007
This paper proposes a protocol for effective coordinated buffer cache management in a multilevel ... more This paper proposes a protocol for effective coordinated buffer cache management in a multilevel cache hierarchy typical of a client/server system. Currently, such cache hierarchies are managed suboptimally-decisions about block placement and replacement are made locally at each level of the hierarchy without coordination between levels. Though straightforward, this approach has several weaknesses: 1) Blocks may be redundantly cached, reducing the effective total cache size, 2) weakened locality at lowerlevel caches makes recency-based replacement algorithms such as LRU less effective, and 3) high-level caches cannot effectively identify blocks with strong locality and may place them in low-level caches. The fundamental reason for these weaknesses is that the locality information embedded in the streams of access requests from clients is not consistently analyzed and exploited, resulting in globally nonsystematic, and therefore suboptimal, placement and replacement of cached blocks across the hierarchy. To address this problem, we propose a coordinated multilevel cache management protocol based on consistent access-locality quantification. In this protocol, locality is dynamically quantified at the client level to direct servers to place or replace blocks appropriately at each level of the cache hierarchy. The result is that the block layout in the entirely hierarchy dynamically matches the locality of block accesses. Our simulation experiments on both synthetic and real-life traces show that the protocol effectively ameliorates these caching problems. As anecdotal evidence, our protocol achieves a reduction of block accesses of 11 percent to 71 percent, with an average of 35 percent, over uniLRU, a unified multilevel cache scheme.
Checkpoint/restart is a general idea for which particular implementations enable various function... more Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant-and ultimately autonomic-large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.
Parallel Processing Letters, 2008
In this work we present an initial performance evaluation of Intel's latest, secondgeneration qua... more In this work we present an initial performance evaluation of Intel's latest, secondgeneration quad-core processor, Nehalem, and provide a comparison to first-generation AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Intel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket compute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intensive codes by a factor of two for a Nehalem node with 8 cores and a Barcelona node containing 16 cores. Further optimizations are possible with Nehalem, including the use of Simultaneous Multithreading, which improves the performance of some applications by up to 50%.
Ever-increasing demand for computing capability is driving the construction of ever-larger comput... more Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, typically comprising commodity compute nodes, ranging in size up to thousands of processors, with each node hosting an instance of the operating system (OS). Recent studies have shown that even minimal intrusion by the OS on user applications, e.g. a slowdown of user processes of less than 1.0% on each OS instance, can result in a dramatic performance degradation-50% or more-when the user applications are executed on thousands of processors.
Ever-increasing demand for computing capability is driving the construction of ever-larger comput... more Ever-increasing demand for computing capability is driving the construction of ever-larger computer clusters, soon to be reaching tens of thousands of processors. Many functionalities of system software have failed to scale accordingly-systems are becoming more complex, less reliable, and less efficient. Our premise is that these deficiencies arise from a lack of global control and coordination of the processing nodes. In practice, current parallel machines are loosely-coupled systems that are used for solving inherently tightly-coupled problems. This paper demonstrates that existing and future systems can be made more scalable by using BSP-like parallel programming principles in the design and implementation of the system software, and by taking full advantage of the latest interconnection network hardware. Moreover, we show that this approach can also yield great improvements in efficiency, reliability, and simplicity.
Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed ... more Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunner's hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture-the Cell BE-and on multicore processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.
The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to d... more The partial evaluation process requires a binding-time analysis. Binding-time analysis seeks to determine which parts of a program's result is determined when some part of the input is known. Domain projections provide a very general way to encode a description of which parts of a data structure are static (known), and which are dynamic (not static). For rst-order functional languages Launchbury Lau91a] has developed an abstract interpretation technique for bindingtime analysis in which the basic abstract value is a projection. Unfortunately this technique does not generalise easily to higher-order languages. This paper develops such a generalisation: a projection-based abstract interpretation suitable for higher-order binding-time analysis.