Christos Antonopoulos | UNIVERSITY OF THESSALY, GREECE (original) (raw)
Papers by Christos Antonopoulos
Abstract This paper presents the architecture and implementation of a nanothreading interface in ... more Abstract This paper presents the architecture and implementation of a nanothreading interface in the kernel of the Linux operating system for Intel Pentium-based symmetric multiprocessors. The nanothreading interface aims at achieving scalability of parallel programs in multiprogrammed shared memory multiprocessors, where multiple parallel and sequential programs with diverge characteristics and resource requirements execute simultaneously.
Abstract Hardware designers and engineers typically need to explore a multi-parametric design spa... more Abstract Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bit width and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems.
Lecture Notes in …, Jan 1, 2005
This topic covers innovative aspects as well as improvements in already known techniques in algor... more This topic covers innovative aspects as well as improvements in already known techniques in algorithms, programming models, design methods and languages that relate to the development of parallel programs. In the call-for-papers, we stressed several innovative aspects including novel techniques to assemble parallel software from reusable parallel components or from existing sequential code without compromising efficiency, and techniques to adapt parallel software to available resources as well as to the features of ...
… (ICCAD), 2011 IEEE …, Jan 1, 2011
ABSTRACT The problem of automatically generating hardware modules from high level application rep... more ABSTRACT The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this paper, we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators, based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore, a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity, and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers, thereby expanding the scope of FPGAs beyond the realm of hardware design.
Multimedia and Expo …, Jan 1, 2011
Newer video compression standards provide high video quality and greater compression efficiency, ... more Newer video compression standards provide high video quality and greater compression efficiency, compared to their predecessors. Their increased complexity can be outbalanced by leveraging all the levels of available parallelism, task-and data-level, using available off-theshelf hardware, such as current generation"s chip multiprocessors. As we move to more cores though, scalability issues arise and need to be tackled in order to take advantage of the abundant computational power.
… (DAC), 2011 48th …, Jan 1, 2011
Abstract In this paper, we propose a design paradigm for energy efficient and variation-aware ope... more Abstract In this paper, we propose a design paradigm for energy efficient and variation-aware operation of next-generation multicore heterogeneous platforms. The main idea behind the proposed approach lies on the observation that not all operations are equally important in shaping the output quality of various applications and of the overall system. Based on such an observation, we suggest that all levels of the software design stack, including the programming model, compiler, operating system (OS) and run-time system ...
Most modern processors offer hardware support for monitoring performance events related to the in... more Most modern processors offer hardware support for monitoring performance events related to the interaction of applications with specific subunits of the processor . The insight attained from performance monitoring counters is useful for both application programmers and processor manufacturers. Programmers typically employ them as a powerful tool for post-mortem analysis, identification and resolution of performance bottlenecks in their applications. Processor manufacturers, on the other hand, can collect valuable information on the performance of their products while the latter are used in production environments. This knowledge is then exploited during the design phase of future products.
… (FCCM), 2011 IEEE …, Jan 1, 2011
Abstract Accelerators, such as field programmable gate arrays (FPGAs) and graphics processing uni... more Abstract Accelerators, such as field programmable gate arrays (FPGAs) and graphics processing units (GPUs), are special purpose processors designed to speed up compute-intensive sections of applications. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. In this paper, we compare the performance of these architectures, presenting a performance study of SEAL, a fast, software-oriented encryption algorithm on a Virtex-6 FPGA, a Graphics Processor ...
Parallel & Distributed …, Jan 1, 2010
Abstract Wide-angle (fisheye) lenses are often used in virtual reality and computer vision applic... more Abstract Wide-angle (fisheye) lenses are often used in virtual reality and computer vision applications to widen the field of view of conventional cameras. Those lenses, however, distort images. For most real-world applications the video stream needs to be transformed, at real-time (20 frames/sec or better), back to the natural-looking, central perspective space. This paper presents the implementation, optimization and characterization of a fisheye lens distortion correction application on three platforms: a conventional, homogeneous ...
… and Expo (ICME), …, Jan 1, 2010
Abstract Modern multimedia workloads provide increased levels of quality and compression efficien... more Abstract Modern multimedia workloads provide increased levels of quality and compression efficiency at the expense of substantially increased computational complexity. It is important to leverage the off-the-shelf emerging multi-core processor architectures and exploit all levels of parallelism of such workloads in order to achieve real time functionality at a reasonable cost. This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip ...
In Proc. of the IEEE …, Jan 1, 2005
Most scientific applications have high degrees of parallelism and thread-level parallel execution... more Most scientific applications have high degrees of parallelism and thread-level parallel execution appears to be a natural choice for executing these applications on systems composed of SMT processors. Unfortunately, contention for shared resources limits the performance advantages of multithreading on current SMT processors, thus leading to marginal utilization of multiple hardware threads and even slowdown due to multithreading. We show, through a rigorous evaluation with hardware monitoring counters on a real multi-SMT system, that in traditionally scalable parallel applications conflicting resource requirements are -due to the high degree of resource sharing -accountable for deeply suboptimal performance. Motivated by this observation, we investigate the use of alternative forms of multithreaded execution, including adaptive thread throttling and speculative runahead execution, to make better use of the resources of SMT processors. Alongside the evaluation, we propose new methods to integrate these techniques into the same binary to maximize performance on multi-SMT systems. Our study shows that combining adaptive throttling and speculative precomputation with regular thread-level parallelization leads to significant performance improvements in parallel codes which suffer from inter-thread interference and contention on SMTs.
High Performance …, Jan 1, 2005
x Recent advancements in processor technology such as Symmetric Multithreading (SMT) and Chip Mul... more x Recent advancements in processor technology such as Symmetric Multithreading (SMT) and Chip Multiprocessors (CMP) enable parallel processing on a single chip. These processors are used as building blocks of shared-memory UMA and NUMA multiprocessor systems, or even clusters of multiprocessors. New programming languages and tools are necessary to help programmers manage the complexities introduced by systems with multigrain and multilevel execution capabilities. This paper introduces Factory, an objectoriented parallel programming substrate which allows programmers to express parallelism, but alleviates them from having to manage it. Factory is written in C++ without introducing any extensions to the language. Instead, it leverages existing constructs from C++ to express parallel computations. As a result, it is highly portable and does not require compiler support. Moreover, Factory offers programmability and performance comparable with already established multithreading substrates.
Proc. of the 2000 …, Jan 1, 2000
In this paper we present an integrated environment for the efficient support of dynamic paralleli... more In this paper we present an integrated environment for the efficient support of dynamic parallelism with OpenMP on top of Linux-based SMPs. This environment consists of an OpenMPcompliant Fortran77 compiler, a run-time threads library and a modified Linux kernel. The functionality provided by our run-time threads library is used by the NanosCompiler, which converts OpenMP Fortran77 programs to equivalent Fortran77 programs with calls to the library. The NanosCompiler generated applications use a shared arena as a communication path with the OS kernel. This kind of communication facilitates the support of dynamic parallelism, resulting to performance scalability under multiprogramming.
Proceedings of the …, Jan 1, 2007
Multithreaded programs executing on modern high-end computing systems have many potential avenues... more Multithreaded programs executing on modern high-end computing systems have many potential avenues to adapt their execution to improve performance, energy consumption, or both. Program adaptation occurs anytime multiple execution modes are available to the application and one is selected based on information collected during program execution. As a result, some degree of online or offline analysis is required to come to a decision of how best to adapt and there are a variety of tradeoffs to consider when deciding which form of analysis to use, as the overheads they carry with them can vary widely in degree as well as type, as can their effectiveness.
Proc. of the 2004 IEEE …, Jan 1, 2004
We introduce a protocol for dynamically migrating memory pages in home-based Software DSM systems... more We introduce a protocol for dynamically migrating memory pages in home-based Software DSM systems. In these systems each page has a designated home node; yet our protocol allows a node that heavily modifies a page to become its new home. The process is dynamic and totally transparent to the applications programmer. The benefits of our page migration mechanism include the reduction of remote page modifications, faster memory accesses, and less communication overhead.
Proceedings of the 6th …, Jan 1, 2011
Abstract OpenCL is an industry supported standard for writing programs that execute on multicore ... more Abstract OpenCL is an industry supported standard for writing programs that execute on multicore platforms as well as on accelerators, such as GPUs or the SPEs of the Cell BE In this paper we introduce GLOpenCL, a unified development framework which supports OpenCL on both homogeneous, shared memory, as well as on heterogeneous, distributed memory multicores. The framework consists of a compiler, based on the LLVM compiler infrastructure, and a run-time library, sharing the same basic architecture across all target ...
… on Numerical Grid …, Jan 1, 2007
Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the pote... more Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the potential of emerging multithreaded and multicore architectures. This paper evaluates two state-of-the-art generic multithreaded allocators designed for both scalability and locality, against custom allocators, written to optimize the multithreaded implementation of parallel mesh generation algorithms. We use three different algorithms in terms of communication/synchronization requirements. The implementations of all three algorithms are heavily dependent on dynamically allocated pointer-based data structures and all three use optimized internal memory allocators based on application-specific knowledge. For our study we used memory allocators which are implemented and evaluated on two real multiprocessors with a multi-SMT (quad Hyperthreaded Intel) and a multi-CMP/SMT (dual IBM Power5) organization. Our results indicate that properly engineered generic memory allocators can come close or sometimes exceed (in sequential allocation) the performance of custom multi-threaded allocators. These results suggest that in the near future we should be able to develop generic multi-threaded allocators that can adapt to application charac-teristics and increase productivity without compromising performance.
Submitted to the …, Jan 1, 2006
… Conference on Cluster …, Jan 1, 2005
Software DSMs (SDSMs) are an appealing alternative to message passing, since they facilitate the ... more Software DSMs (SDSMs) are an appealing alternative to message passing, since they facilitate the programmability of clusters. However the ease of programming comes at the expense of performance. Although accesses of data that reside to the memory of remote nodes are transparent to the programmer, they suffer from significantly higher latencies compared to local accesses. As a consequence, it is desirable to move data as close as possible to the nodes that need them most.
Abstract This paper presents the architecture and implementation of a nanothreading interface in ... more Abstract This paper presents the architecture and implementation of a nanothreading interface in the kernel of the Linux operating system for Intel Pentium-based symmetric multiprocessors. The nanothreading interface aims at achieving scalability of parallel programs in multiprogrammed shared memory multiprocessors, where multiple parallel and sequential programs with diverge characteristics and resource requirements execute simultaneously.
Abstract Hardware designers and engineers typically need to explore a multi-parametric design spa... more Abstract Hardware designers and engineers typically need to explore a multi-parametric design space in order to find the best configuration for their designs using simulations that can take weeks to months to complete. For example, designers of special purpose chips need to explore parameters such as the optimal bit width and data representation. This is the case for the development of complex algorithms such as Low-Density Parity-Check (LDPC) decoders used in modern communication systems.
Lecture Notes in …, Jan 1, 2005
This topic covers innovative aspects as well as improvements in already known techniques in algor... more This topic covers innovative aspects as well as improvements in already known techniques in algorithms, programming models, design methods and languages that relate to the development of parallel programs. In the call-for-papers, we stressed several innovative aspects including novel techniques to assemble parallel software from reusable parallel components or from existing sequential code without compromising efficiency, and techniques to adapt parallel software to available resources as well as to the features of ...
… (ICCAD), 2011 IEEE …, Jan 1, 2011
ABSTRACT The problem of automatically generating hardware modules from high level application rep... more ABSTRACT The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this paper, we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators, based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore, a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity, and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers, thereby expanding the scope of FPGAs beyond the realm of hardware design.
Multimedia and Expo …, Jan 1, 2011
Newer video compression standards provide high video quality and greater compression efficiency, ... more Newer video compression standards provide high video quality and greater compression efficiency, compared to their predecessors. Their increased complexity can be outbalanced by leveraging all the levels of available parallelism, task-and data-level, using available off-theshelf hardware, such as current generation"s chip multiprocessors. As we move to more cores though, scalability issues arise and need to be tackled in order to take advantage of the abundant computational power.
… (DAC), 2011 48th …, Jan 1, 2011
Abstract In this paper, we propose a design paradigm for energy efficient and variation-aware ope... more Abstract In this paper, we propose a design paradigm for energy efficient and variation-aware operation of next-generation multicore heterogeneous platforms. The main idea behind the proposed approach lies on the observation that not all operations are equally important in shaping the output quality of various applications and of the overall system. Based on such an observation, we suggest that all levels of the software design stack, including the programming model, compiler, operating system (OS) and run-time system ...
Most modern processors offer hardware support for monitoring performance events related to the in... more Most modern processors offer hardware support for monitoring performance events related to the interaction of applications with specific subunits of the processor . The insight attained from performance monitoring counters is useful for both application programmers and processor manufacturers. Programmers typically employ them as a powerful tool for post-mortem analysis, identification and resolution of performance bottlenecks in their applications. Processor manufacturers, on the other hand, can collect valuable information on the performance of their products while the latter are used in production environments. This knowledge is then exploited during the design phase of future products.
… (FCCM), 2011 IEEE …, Jan 1, 2011
Abstract Accelerators, such as field programmable gate arrays (FPGAs) and graphics processing uni... more Abstract Accelerators, such as field programmable gate arrays (FPGAs) and graphics processing units (GPUs), are special purpose processors designed to speed up compute-intensive sections of applications. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. In this paper, we compare the performance of these architectures, presenting a performance study of SEAL, a fast, software-oriented encryption algorithm on a Virtex-6 FPGA, a Graphics Processor ...
Parallel & Distributed …, Jan 1, 2010
Abstract Wide-angle (fisheye) lenses are often used in virtual reality and computer vision applic... more Abstract Wide-angle (fisheye) lenses are often used in virtual reality and computer vision applications to widen the field of view of conventional cameras. Those lenses, however, distort images. For most real-world applications the video stream needs to be transformed, at real-time (20 frames/sec or better), back to the natural-looking, central perspective space. This paper presents the implementation, optimization and characterization of a fisheye lens distortion correction application on three platforms: a conventional, homogeneous ...
… and Expo (ICME), …, Jan 1, 2010
Abstract Modern multimedia workloads provide increased levels of quality and compression efficien... more Abstract Modern multimedia workloads provide increased levels of quality and compression efficiency at the expense of substantially increased computational complexity. It is important to leverage the off-the-shelf emerging multi-core processor architectures and exploit all levels of parallelism of such workloads in order to achieve real time functionality at a reasonable cost. This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip ...
In Proc. of the IEEE …, Jan 1, 2005
Most scientific applications have high degrees of parallelism and thread-level parallel execution... more Most scientific applications have high degrees of parallelism and thread-level parallel execution appears to be a natural choice for executing these applications on systems composed of SMT processors. Unfortunately, contention for shared resources limits the performance advantages of multithreading on current SMT processors, thus leading to marginal utilization of multiple hardware threads and even slowdown due to multithreading. We show, through a rigorous evaluation with hardware monitoring counters on a real multi-SMT system, that in traditionally scalable parallel applications conflicting resource requirements are -due to the high degree of resource sharing -accountable for deeply suboptimal performance. Motivated by this observation, we investigate the use of alternative forms of multithreaded execution, including adaptive thread throttling and speculative runahead execution, to make better use of the resources of SMT processors. Alongside the evaluation, we propose new methods to integrate these techniques into the same binary to maximize performance on multi-SMT systems. Our study shows that combining adaptive throttling and speculative precomputation with regular thread-level parallelization leads to significant performance improvements in parallel codes which suffer from inter-thread interference and contention on SMTs.
High Performance …, Jan 1, 2005
x Recent advancements in processor technology such as Symmetric Multithreading (SMT) and Chip Mul... more x Recent advancements in processor technology such as Symmetric Multithreading (SMT) and Chip Multiprocessors (CMP) enable parallel processing on a single chip. These processors are used as building blocks of shared-memory UMA and NUMA multiprocessor systems, or even clusters of multiprocessors. New programming languages and tools are necessary to help programmers manage the complexities introduced by systems with multigrain and multilevel execution capabilities. This paper introduces Factory, an objectoriented parallel programming substrate which allows programmers to express parallelism, but alleviates them from having to manage it. Factory is written in C++ without introducing any extensions to the language. Instead, it leverages existing constructs from C++ to express parallel computations. As a result, it is highly portable and does not require compiler support. Moreover, Factory offers programmability and performance comparable with already established multithreading substrates.
Proc. of the 2000 …, Jan 1, 2000
In this paper we present an integrated environment for the efficient support of dynamic paralleli... more In this paper we present an integrated environment for the efficient support of dynamic parallelism with OpenMP on top of Linux-based SMPs. This environment consists of an OpenMPcompliant Fortran77 compiler, a run-time threads library and a modified Linux kernel. The functionality provided by our run-time threads library is used by the NanosCompiler, which converts OpenMP Fortran77 programs to equivalent Fortran77 programs with calls to the library. The NanosCompiler generated applications use a shared arena as a communication path with the OS kernel. This kind of communication facilitates the support of dynamic parallelism, resulting to performance scalability under multiprogramming.
Proceedings of the …, Jan 1, 2007
Multithreaded programs executing on modern high-end computing systems have many potential avenues... more Multithreaded programs executing on modern high-end computing systems have many potential avenues to adapt their execution to improve performance, energy consumption, or both. Program adaptation occurs anytime multiple execution modes are available to the application and one is selected based on information collected during program execution. As a result, some degree of online or offline analysis is required to come to a decision of how best to adapt and there are a variety of tradeoffs to consider when deciding which form of analysis to use, as the overheads they carry with them can vary widely in degree as well as type, as can their effectiveness.
Proc. of the 2004 IEEE …, Jan 1, 2004
We introduce a protocol for dynamically migrating memory pages in home-based Software DSM systems... more We introduce a protocol for dynamically migrating memory pages in home-based Software DSM systems. In these systems each page has a designated home node; yet our protocol allows a node that heavily modifies a page to become its new home. The process is dynamic and totally transparent to the applications programmer. The benefits of our page migration mechanism include the reduction of remote page modifications, faster memory accesses, and less communication overhead.
Proceedings of the 6th …, Jan 1, 2011
Abstract OpenCL is an industry supported standard for writing programs that execute on multicore ... more Abstract OpenCL is an industry supported standard for writing programs that execute on multicore platforms as well as on accelerators, such as GPUs or the SPEs of the Cell BE In this paper we introduce GLOpenCL, a unified development framework which supports OpenCL on both homogeneous, shared memory, as well as on heterogeneous, distributed memory multicores. The framework consists of a compiler, based on the LLVM compiler infrastructure, and a run-time library, sharing the same basic architecture across all target ...
… on Numerical Grid …, Jan 1, 2007
Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the pote... more Scalable and locality-aware multiprocessor memory allocators are critical for harnessing the potential of emerging multithreaded and multicore architectures. This paper evaluates two state-of-the-art generic multithreaded allocators designed for both scalability and locality, against custom allocators, written to optimize the multithreaded implementation of parallel mesh generation algorithms. We use three different algorithms in terms of communication/synchronization requirements. The implementations of all three algorithms are heavily dependent on dynamically allocated pointer-based data structures and all three use optimized internal memory allocators based on application-specific knowledge. For our study we used memory allocators which are implemented and evaluated on two real multiprocessors with a multi-SMT (quad Hyperthreaded Intel) and a multi-CMP/SMT (dual IBM Power5) organization. Our results indicate that properly engineered generic memory allocators can come close or sometimes exceed (in sequential allocation) the performance of custom multi-threaded allocators. These results suggest that in the near future we should be able to develop generic multi-threaded allocators that can adapt to application charac-teristics and increase productivity without compromising performance.
Submitted to the …, Jan 1, 2006
… Conference on Cluster …, Jan 1, 2005
Software DSMs (SDSMs) are an appealing alternative to message passing, since they facilitate the ... more Software DSMs (SDSMs) are an appealing alternative to message passing, since they facilitate the programmability of clusters. However the ease of programming comes at the expense of performance. Although accesses of data that reside to the memory of remote nodes are transparent to the programmer, they suffer from significantly higher latencies compared to local accesses. As a consequence, it is desirable to move data as close as possible to the nodes that need them most.
HPCLAB-TR-021298, Dec 2, 1998
This paper presents the architecture and implementation of a nanothreading interface in the kerne... more This paper presents the architecture and implementation of a nanothreading interface in the kernel of the Linux operating system for Intel Pentium-based symmetric multiprocessors. The nanothreading interface aims at achieving scalability of parallel programs in multiprogrammed shared memory multiprocessors, where multiple parallel and sequential programs with diverge characteristics and resource requirements execute simultaneously. The main idea of the nanothreading interface is to let parallel programs and the kernel exchange critical scheduling information through shared memory with minimal overhead, in order to let parallel programs adapt to dynamically changing resources and ensure that all programs running in the system will minimize their idle time and make always progress along their critical path. We evaluate both the overhead of the low-level nanothreading mechanisms and the efficiency of the nanothreading interface in terms of system throughput, using multiprogrammed workloads with parallel benchmarks. Our results substantiate the efficiency of our implementation and demonstrate that the nanothreading kernel provides solid improvements over the native Linux SMP kernel.