The shared-thread multiprocessor (original) (raw)

A preliminary performance study of architectural support for multithreading

Proceedings of the Thirtieth Hawaii International Conference on System Sciences, 1997

This paper discusses the preliminary performance study of hybrid multithreaded execution model that combines software-controlled multithreaded system with hardware support for efficient context switching and threads scheduling. The hardware support for multithreading is augmented with a software thread scheduling technique called setscheduling, and their benefit to the overall performance is discussed. Set scheduling schedules multiple threads onto the hardware scheduler to minimize the software scheduling and context switching costs. An analytical model of the proposed multithreaded model is discussed and simulation results of processor utilization based on the proposed model are presented. Through simulation, we find that the hybrid multithreaded execution model results in high processor utilization than traditional softwarecontrolled multithreading.

A Survey on Hardware and Software Support for Thread Level Parallelism

arXiv (Cornell University), 2016

To support growing massive parallelism, functional components and also the capabilities of current processors are changing and continue to do so. Todays computers are built upon multiple processing cores and run applications consisting of a large number of threads, making runtime thread management a complex process. Further, each core can support multiple, concurrent thread execution. Hence, hardware and software support for threads is more and more needed to improve peak-performance capacity, overall system throughput, and has therefore been the subject of much research. This paper surveys, many of the proposed or currently available solutions for executing, distributing and managing threads both in hardware and software. The nature of current applications is diverse. To increase the system performance, all programming models may not be suitable to harness the built-in massive parallelism of multicore processors. Due to the heterogeneity in hardware, hybrid programming model (which combines the features of shared and distributed model) currently has become very promising. In this paper, first, we have given an overview of threads, threading mechanisms and its management issues during execution. Next, we discuss about different parallel programming models considering to their explicit thread support. We also review the programming models with respect to their support to shared-memory, distributed-memory and heterogeneity. Hardware support at execution time is very crucial to the performance of the system, thus different types of hardware support for threads also exist or have been proposed, primarily based on widely used programming models. We also further discuss on software support for threads, to mainly increase the deterministic behavior during runtime. Finally, we conclude the paper by discussing some common issues related to the thread management.

Development of a simultaneously threaded multi-core processor

… Technologies for the …, 2005

Simultaneous Multithreading (SMT) is becoming one of the major trends in the design of future generations of microarchitectures. Its key strength comes from its ability to exploit both threadlevel and instruction-level parallelism; it uses hardware resources efficiently. Nevertheless, SMT has its limitations: contention between threads may cause conflicts; lack of scalability, additional pipeline stages, and inefficient handling of long latency operations. Alternatively, Chip Multiprocessors (CMP) are highly scalable and easy to program. On the other hand, they are expensive and suffer from cache coherence and memory consistency problems. This paper proposes a microarchitecture that exploits parallelism at instruction, thread, and processor levels. It merges both concepts of SMT and CMP. Like CMP, multiple cores are used on a single chip. Hardware resources are replicated in each core except for the secondary-level cache which is shared amongst all cores. The processor applies the SMT technique within each core to make full use of available hardware resources. Moreover, the communication overhead is reduced due to the inter-independence between cores. Results show that the proposed microarchitecture outperforms both SMT and CMP. In addition, resources are more evenly distributed amongst running threads.

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

2007

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid-and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.

Intel Threading Building Blocks

Shared Memory Application Programming, 2016

This paper describes two features of Intel ® Threading Building Blocks (Intel ® TBB) [1] that provide the foundation for its robust performance: a work-stealing task scheduler and a scalable memory allocator. Work-stealing task schedulers efficiently balance load while maintaining the natural data locality found in many applications. The Intel TBB task scheduler is available to users directly through an API and is also used in the implementation of the algorithms included in the library. In this paper, we provide an overview of the TBB task scheduler and discuss three manual optimizations that users can make to improve its performance: continuation passing, scheduler bypass, and task recycling. In the Experimental Results section of this paper, we provide performance results for several benchmarks that demonstrate the potential scalability of applications threaded with TBB, as well as the positive impact of these manual optimizations on the performance of fine-grain tasks. The task scheduler is complemented by the Intel TBB scalable memory allocator. Memory allocation can often be a limiting bottleneck in parallel applications. Using the TBB scalable memory allocator eliminates this bottleneck and also improves cache behavior. We discuss details of the design and implementation of the TBB scalable allocator and evaluate its performance relative to several commercial and non-commercial allocators, showing that the TBB allocator is competitive with these other allocators. Intel Corporation uses the Palm OS ® Ready mark under license from Palm, Inc.

Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors

ACM Transactions on Architecture and Code Optimization, 2010

We provide an analysis of thread-management techniques that increase performance or reduce energy in multicore and Simultaneous Multithreaded (SMT) cores. Thread delaying reduces energy consumption by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing noncritical threads. In this article, we provide an insightful breakdown of thread delaying

Multi-Core Processors: New Way to Achieve High System Performance

International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06)

Multi-core processors represent an evolutionary change in conventional computing as well setting the new trend for high performance computing (HPC)but parallelism is nothing new. Intel has a long history with the concept of parallelism and the development of hardware-enhanced threading capabilities. Intel has been delivering threadingcapable products for more than a decade. The move toward chip-level multiprocessing architectures with a large number of cores continues to offer dramatically increased performance and power characteristics. Nonetheless, this move also presents significant challenges. This paper will describe how far the industry has progressed and evaluates some of the challenges we are facing with multi-core processors and some of the solutions that have been developed.

Balanced multithreading: Increasing throughput via a low cost multithreading hierarchy

Proceedings of the 37th …, 2004

A simultaneous multithreading (SMT) processor can issue instructions from several threads every cycle, allowing it to effectively hide various instruction latencies; this effect increases with the number of simultaneous contexts supported. However, each added context on an SMT processor incurs a cost in complexity, which may lead to an increase in pipeline length or a decrease in the maximum clock rate. This paper presents new designs for multithreaded processors which combine a conservative SMT implementation with a coarsegrained multithreading capability. By presenting more virtual contexts to the operating system and user than are supported in the core pipeline, the new designs can take advantage of the memory parallelism present in workloads with many threads, while avoiding the performance penalties inherent in a manycontext SMT processor design. A design with 4 virtual contexts, but which is based on a 2-context SMT processor core, gains an additional 26% throughput when 4 threads are run together.

Multi-Core Processors: New Way to Achieve High System Performance. Multi-Core Processors: New Way to Achieve High System Performance

Multi-core processors represent an evolutionary change in conventional computing as well setting the new trend for high performance computing (HPC)-but parallelism is nothing new. Intel has a long history with the concept of parallelism and the development of hardware-enhanced threading capabilities. Intel has been delivering threading-capable products for more than a decade. The move toward chip-level multiprocessing architectures with a large number of cores continues to offer dramatically increased performance and power characteristics. Nonetheless, this move also presents significant challenges. This paper will describe how far the industry has progressed and evaluates some of the challenges we are facing with multi-core processors and some of the solutions that have been developed.

A survey of processors with explicit multithreading

ACM Computing Surveys, 2003

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.