Arnaldo Azevedo - Academia.edu (original) (raw)
Papers by Arnaldo Azevedo
The SARC architecture is composed of multiple processor types and a set of user-managed direct me... more The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications, with matching support from specialized accelerator processors.
Avances en Sistemas e Informática, 2009
Abstract. An important question is whether emerging and future applications exhibit sufficient pa... more Abstract. An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism (TLP), to exploit the large numbers of cores future CMPs are expected to contain. As a case study we investigate the parallel scalability of the H.264 decoding process. Previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. We therefore propose a novel strategy, called 3D-Wave, which is mainly based on the observation that inter-frame dependencies have a limited spatial range. Because of this, certain MBs of consecutive frames can be decoded in parallel. The 3D-Wave strategy allows 4000 to 9000 MBs to be processed in parallel, depending on the input sequence. We also perform a case study to assess the practical value and possibilities of the 3D-Wave strategy. The results show that our strategy provides sufficient parallelism to effi...
© The Author(s) 2008. This article is published with open access at Springerlink.com Abstract An ... more © The Author(s) 2008. This article is published with open access at Springerlink.com Abstract An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an impor-tant application domain now and in the future. Specif-ically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependen-cies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-W...
This paper presents a study of the performance scalability of a macroblock-level parallelization ... more This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. Three different scheduling techniques were analyzed: static, dynamic and dynamic with tail-submit. A dynamic scheduling approach with a tail-submit optimization presents the best performance obtaining a maximum speed-up of 9.5 using 24 processors. A detailed profiling analysis showed that thread synchronization is one of the limiting factors for achieving a better parallel scalability. The paper includes an evaluation of the impact of using blocking synchronization APIs like POSIX threads and POSIX real-time extensions. Results showed that macroblock-level parallelism as a very finegrain form of Thread-Level...
The X4CP32 is a novel coarse RPU runtimereconfigurable general purpose microprocessor. It consist... more The X4CP32 is a novel coarse RPU runtimereconfigurable general purpose microprocessor. It consists of 3 programming levels, based on a hierarchic array of easily and quickly reconfigurable entities. It brings a new concept of runtime reconfiguration and programming, which is it’s main strength. Although it’s effectiveness in heavy arithmetic applications, it is suited for virtually any task an application can demand, presenting as a solid option for a general purpose microprocessor.
In this paper we present possibilities to parallelize Deblocking Filter (DF) of H.264 video codec... more In this paper we present possibilities to parallelize Deblocking Filter (DF) of H.264 video codec and results on Decoupled Threaded Architecture (DTA). We exploited all the available parallelism in the code in order to make it suitable for DTA architecture. Experimental results show that significant speedup can be achieved and that DTA architecture can efficiently exploit available parallelism.
The X4CP32 is an architecture that combines both Parallel and reconfigurable paradigms. It consis... more The X4CP32 is an architecture that combines both Parallel and reconfigurable paradigms. It consists of grid of Reconfigurable and Programming Unit (RPU), responsible for all the processing and program flow. This paper presents architectural modification in order to maximize the computational use of the Cells in a RPU. A change to a very large instruction word (VLIW) philosophy in the RPU was implemented to reach the objective. This changes raise the Instructions per Cycle of the RPU from 0.5 to 1 with no area overhead and no influence in clock frequency.
2010 First IEEE Latin American Symposium on Circuits and Systems (LASCAS), 2010
This article presents an architecture for a motion vectors predictor using H.264/AVC standard Mai... more This article presents an architecture for a motion vectors predictor using H.264/AVC standard Main profile. The motion vectors predictor is one of the most important modules of motion compensation. This architecture was developed to work at 100 MHz, providing a processing rate capable of decoding HDTV in real time. The hardware is composed by a bank of registers and a state machine operating over the registered data. The design was synthesized for FPGA Xilinx Virtex II-PRO and ASIC TSMC 0,18 μm technology reaching maximum frequency of operation of 133 MHz and 129 MHz, respectively.
An important question is whether emerging and future applications exhibit sufficient parallelism,... more An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism (TLP), to exploit the large numbers of cores future CMPs are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slicelevel, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MBlevel parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.
International Journal of Embedded and Real-Time Communication Systems, 2010
In many kernels of multimedia applications, the working set is predictable, making it possible to... more In many kernels of multimedia applications, the working set is predictable, making it possible to schedule the data transfers before the computation. Many other kernels, however, process data that is known just before it is needed or have working sets that do not fit in the scratchpad memory. Furthermore, multimedia kernels often access two or higher dimensional data structures and conventional software caches have difficulties to exploit the data locality exhibited by these kernels. For such kernels, the authors present a Multidimensional Software Cache (MDSC), which stores 1- 4 dimensional blocks to mimic in cache the organization of the data structure. Furthermore, it indexes the cache using the matrix indices rather than linear memory addresses. MDSC also makes use of the lower overhead of Direct Memory Access (DMA) list transfers and allows exploiting known data access patterns to reduce the number of accesses to the cache. The MDSC is evaluated using GLCM, providing an 8% perf...
In this dissertation we present methodologies and evaluations aiming at increasing the efficiency... more In this dissertation we present methodologies and evaluations aiming at increasing the efficiency of video coding applications for heterogeneous many-core processors composed of SIMD-only, scratchpad memory based cores. Our contributions are spread in three different fronts: thread-level parallelism strategies for many-cores, identification of bottlenecks for SIMD-only cores, and software cache for scratchpad memory based cores. First, we present the 3D-Wave parallelization strategy for video decoding that scales for many-core processors. It is based on the observation that dependencies between frames are related with the motion compensation kernel and motion vectors are usually within a small range. The 3D-Wave strategy combines macroblock-level parallelism with frame- and slice-level parallelism by overlapping the decoding of frames while dynamically managing macroblock dependencies. The 3D-Wave was implemented and evaluated in a simulated many-core embedded processor consisting o...
In many kernels of multimedia applications, the working set is predictable, making it possible to... more In many kernels of multimedia applications, the working set is predictable, making it possible to schedule the data transfers before the computation. Many other kernels, however, process data that is known just before it is needed or have working sets that do not fit in the scratchpad memory. Furthermore, multimedia kernels often access two or higher dimensional data structures and conventional software caches have difficulties to exploit the data locality exhibited by these kernels. For such kernels, the authors present a Multidimensional Software Cache MDSC, which stores 1-4 dimensional blocks to mimic in cache the organization of the data structure. Furthermore, it indexes the cache using the matrix indices rather than linear memory addresses. MDSC also makes use of the lower overhead of Direct Memory Access DMA list transfers and allows exploiting known data access patterns to reduce the number of accesses to the cache. The MDSC is evaluated using GLCM, providing an 8% performan...
Lecture Notes in Computer Science, 2011
In this paper we propose an instruction to accelerate software caches. While DMAs are very effici... more In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
2009 International Symposium on System-on-Chip, 2009
In this paper we propose an instruction to accelerate software caches. While DMAs are very effici... more In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., 2003
Skip to Main Content. IEEE.org | IEEE Xplore Digital Library | IEEE Standards Association | Spect... more Skip to Main Content. IEEE.org | IEEE Xplore Digital Library | IEEE Standards Association | Spectrum Online | More IEEE Sites. IEEE Xplore Digital Library. Search Term(s). Advanced Search | Preferences | Search Tips. ...
Proceedings International Parallel and Distributed Processing Symposium, 2003
Proceedings. 15th Symposium on Computer Architecture and High Performance Computing, 2003
Page 1. Abstract The X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels... more Page 1. Abstract The X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. This paper ...
2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009
The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instru... more The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications. In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.
The SARC architecture is composed of multiple processor types and a set of user-managed direct me... more The SARC architecture is composed of multiple processor types and a set of user-managed direct memory access (DMA) engines that let the runtime scheduler overlap data transfer and computation. The runtime system automatically allocates tasks on the heterogeneous cores and schedules the data transfers through the DMA engines. SARC's programming model supports various highly parallel applications, with matching support from specialized accelerator processors.
Avances en Sistemas e Informática, 2009
Abstract. An important question is whether emerging and future applications exhibit sufficient pa... more Abstract. An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism (TLP), to exploit the large numbers of cores future CMPs are expected to contain. As a case study we investigate the parallel scalability of the H.264 decoding process. Previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. We therefore propose a novel strategy, called 3D-Wave, which is mainly based on the observation that inter-frame dependencies have a limited spatial range. Because of this, certain MBs of consecutive frames can be decoded in parallel. The 3D-Wave strategy allows 4000 to 9000 MBs to be processed in parallel, depending on the input sequence. We also perform a case study to assess the practical value and possibilities of the 3D-Wave strategy. The results show that our strategy provides sufficient parallelism to effi...
© The Author(s) 2008. This article is published with open access at Springerlink.com Abstract An ... more © The Author(s) 2008. This article is published with open access at Springerlink.com Abstract An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism, to exploit the large numbers of cores future chip multiprocessors (CMPs) are expected to contain. As a case study we investigate the parallelism available in video decoders, an impor-tant application domain now and in the future. Specif-ically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slice-level, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependen-cies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-W...
This paper presents a study of the performance scalability of a macroblock-level parallelization ... more This paper presents a study of the performance scalability of a macroblock-level parallelization of the H.264 decoder for High Definition (HD) applications on a multiprocessor architecture. We have implemented this parallelization on a cache coherent Non-uniform Memory Access (cc-NUMA) shared memory multiprocessor (SMP) and compared the results with the theoretical expectations. Three different scheduling techniques were analyzed: static, dynamic and dynamic with tail-submit. A dynamic scheduling approach with a tail-submit optimization presents the best performance obtaining a maximum speed-up of 9.5 using 24 processors. A detailed profiling analysis showed that thread synchronization is one of the limiting factors for achieving a better parallel scalability. The paper includes an evaluation of the impact of using blocking synchronization APIs like POSIX threads and POSIX real-time extensions. Results showed that macroblock-level parallelism as a very finegrain form of Thread-Level...
The X4CP32 is a novel coarse RPU runtimereconfigurable general purpose microprocessor. It consist... more The X4CP32 is a novel coarse RPU runtimereconfigurable general purpose microprocessor. It consists of 3 programming levels, based on a hierarchic array of easily and quickly reconfigurable entities. It brings a new concept of runtime reconfiguration and programming, which is it’s main strength. Although it’s effectiveness in heavy arithmetic applications, it is suited for virtually any task an application can demand, presenting as a solid option for a general purpose microprocessor.
In this paper we present possibilities to parallelize Deblocking Filter (DF) of H.264 video codec... more In this paper we present possibilities to parallelize Deblocking Filter (DF) of H.264 video codec and results on Decoupled Threaded Architecture (DTA). We exploited all the available parallelism in the code in order to make it suitable for DTA architecture. Experimental results show that significant speedup can be achieved and that DTA architecture can efficiently exploit available parallelism.
The X4CP32 is an architecture that combines both Parallel and reconfigurable paradigms. It consis... more The X4CP32 is an architecture that combines both Parallel and reconfigurable paradigms. It consists of grid of Reconfigurable and Programming Unit (RPU), responsible for all the processing and program flow. This paper presents architectural modification in order to maximize the computational use of the Cells in a RPU. A change to a very large instruction word (VLIW) philosophy in the RPU was implemented to reach the objective. This changes raise the Instructions per Cycle of the RPU from 0.5 to 1 with no area overhead and no influence in clock frequency.
2010 First IEEE Latin American Symposium on Circuits and Systems (LASCAS), 2010
This article presents an architecture for a motion vectors predictor using H.264/AVC standard Mai... more This article presents an architecture for a motion vectors predictor using H.264/AVC standard Main profile. The motion vectors predictor is one of the most important modules of motion compensation. This architecture was developed to work at 100 MHz, providing a processing rate capable of decoding HDTV in real time. The hardware is composed by a bank of registers and a state machine operating over the registered data. The design was synthesized for FPGA Xilinx Virtex II-PRO and ASIC TSMC 0,18 μm technology reaching maximum frequency of operation of 133 MHz and 129 MHz, respectively.
An important question is whether emerging and future applications exhibit sufficient parallelism,... more An important question is whether emerging and future applications exhibit sufficient parallelism, in particular thread-level parallelism (TLP), to exploit the large numbers of cores future CMPs are expected to contain. As a case study we investigate the parallelism available in video decoders, an important application domain now and in the future. Specifically, we analyze the parallel scalability of the H.264 decoding process. First we discuss the data structures and dependencies of H.264 and show what types of parallelism it allows to be exploited. We also show that previously proposed parallelization strategies such as slicelevel, frame-level, and intra-frame macroblock (MB) level parallelism, are not sufficiently scalable. Based on the observation that inter-frame dependencies have a limited spatial range we propose a new parallelization strategy, called Dynamic 3D-Wave. It allows certain MBs of consecutive frames to be decoded in parallel. Using this new strategy we analyze the limits to the available MBlevel parallelism in H.264. Using real movie sequences we find a maximum MB parallelism ranging from 4000 to 7000. We also perform a case study to assess the practical value and possibilities of a highly parallelized H.264 application. The results show that H.264 exhibits sufficient parallelism to efficiently exploit the capabilities of future manycore CMPs.
International Journal of Embedded and Real-Time Communication Systems, 2010
In many kernels of multimedia applications, the working set is predictable, making it possible to... more In many kernels of multimedia applications, the working set is predictable, making it possible to schedule the data transfers before the computation. Many other kernels, however, process data that is known just before it is needed or have working sets that do not fit in the scratchpad memory. Furthermore, multimedia kernels often access two or higher dimensional data structures and conventional software caches have difficulties to exploit the data locality exhibited by these kernels. For such kernels, the authors present a Multidimensional Software Cache (MDSC), which stores 1- 4 dimensional blocks to mimic in cache the organization of the data structure. Furthermore, it indexes the cache using the matrix indices rather than linear memory addresses. MDSC also makes use of the lower overhead of Direct Memory Access (DMA) list transfers and allows exploiting known data access patterns to reduce the number of accesses to the cache. The MDSC is evaluated using GLCM, providing an 8% perf...
In this dissertation we present methodologies and evaluations aiming at increasing the efficiency... more In this dissertation we present methodologies and evaluations aiming at increasing the efficiency of video coding applications for heterogeneous many-core processors composed of SIMD-only, scratchpad memory based cores. Our contributions are spread in three different fronts: thread-level parallelism strategies for many-cores, identification of bottlenecks for SIMD-only cores, and software cache for scratchpad memory based cores. First, we present the 3D-Wave parallelization strategy for video decoding that scales for many-core processors. It is based on the observation that dependencies between frames are related with the motion compensation kernel and motion vectors are usually within a small range. The 3D-Wave strategy combines macroblock-level parallelism with frame- and slice-level parallelism by overlapping the decoding of frames while dynamically managing macroblock dependencies. The 3D-Wave was implemented and evaluated in a simulated many-core embedded processor consisting o...
In many kernels of multimedia applications, the working set is predictable, making it possible to... more In many kernels of multimedia applications, the working set is predictable, making it possible to schedule the data transfers before the computation. Many other kernels, however, process data that is known just before it is needed or have working sets that do not fit in the scratchpad memory. Furthermore, multimedia kernels often access two or higher dimensional data structures and conventional software caches have difficulties to exploit the data locality exhibited by these kernels. For such kernels, the authors present a Multidimensional Software Cache MDSC, which stores 1-4 dimensional blocks to mimic in cache the organization of the data structure. Furthermore, it indexes the cache using the matrix indices rather than linear memory addresses. MDSC also makes use of the lower overhead of Direct Memory Access DMA list transfers and allows exploiting known data access patterns to reduce the number of accesses to the cache. The MDSC is evaluated using GLCM, providing an 8% performan...
Lecture Notes in Computer Science, 2011
In this paper we propose an instruction to accelerate software caches. While DMAs are very effici... more In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
2009 International Symposium on System-on-Chip, 2009
In this paper we propose an instruction to accelerate software caches. While DMAs are very effici... more In this paper we propose an instruction to accelerate software caches. While DMAs are very efficient for predictable data sets that can be fetched before they are needed, they introduce a large latency overhead for computations with unpredictable access behavior. Software caches are advantageous when the data set is not predictable but exhibits locality. However, software caches also incur a large overhead. Because the main overhead is in the access function, we propose an instruction that replaces the look-up function of the software cache. This instruction is evaluated using the Multidimensional Software Cache and two multimedia kernels, GLCM and H.264 Motion Compensation. The results show that the proposed instruction accelerates the software cache access time by a factor of 2.6. This improvement translates to a 2.1 speedup for GLCM and 1.28 for MC, when compared with the IBM software cache.
16th Symposium on Integrated Circuits and Systems Design, 2003. SBCCI 2003. Proceedings., 2003
Skip to Main Content. IEEE.org | IEEE Xplore Digital Library | IEEE Standards Association | Spect... more Skip to Main Content. IEEE.org | IEEE Xplore Digital Library | IEEE Standards Association | Spectrum Online | More IEEE Sites. IEEE Xplore Digital Library. Search Term(s). Advanced Search | Preferences | Search Tips. ...
Proceedings International Parallel and Distributed Processing Symposium, 2003
Proceedings. 15th Symposium on Computer Architecture and High Performance Computing, 2003
Page 1. Abstract The X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels... more Page 1. Abstract The X4CP32 is a parallel/reconfigurable microprocessor with 2 programming levels. Although it is a general-purpose microprocessor, it has the reliable performance of a reconfigurable architecture. This paper ...
2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009
The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instru... more The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications. In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.